Libraries for next generation sequencing

ABSTRACT

Provided herein are compositions and methods for Next Generation Sequencing. Further provided herein are compositions and methods for uniquely labeling molecules. Further provided herein are compositions and methods for synthesizing unique molecular identifiers.

CROSS-REFERENCE

This application claims the benefit of U.S. provisional patentapplication No. 63/105,824 filed on Oct. 26, 2020, which is incorporatedherein by reference in its entirety.

BACKGROUND

Nucleic acid sequencing with high fidelity and low cost has a centralrole in biotechnology and medicine, and in basic biomedical research.While various methods are known for sequencing complex nucleic acidsamples, these techniques often suffer from scalability, automation,speed, accuracy, and cost.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.

BRIEF SUMMARY

Provided herein are compositions and methods for uniquely labelingpolynucleotides.

Provided herein are methods of sequencing comprising: ligating one ormore polynucleotide adapters to a plurality of sample nucleic acids togenerate a library of adapter-ligated sample polynucleotides, wherein atleast some of the polynucleotide adapters comprise a first uniquemolecular identifier and a second unique molecular identifier;amplifying the library; sequencing the enriched library to generate aplurality of reads; organizing the reads based on the first uniquemolecular identifier and the second unique molecular identifier todistinguish between amplification errors and single nucleotidepolymorphisms present in the sample nucleic acids, and wherein at least80% of SNV variants are called at a level of at least 1%. Furtherprovided herein are methods wherein the SNV variants are called with aminimum sequencing depth of 10,000×. Further provided herein are methodswherein at least 90% of SNV variants are called at a level of at least1%. Further provided herein are methods wherein at least 95% of SNVvariants are called at a level of at least 1%. Further provided hereinare methods wherein at least 95% of SNV variants are called at a levelof at least 0.5%. Further provided herein are methods wherein at least95% of SNV variants are called at a level of at least 1% with a minimumsequencing depth of 10,000×. Further provided herein are methods whereinat least 95% of SNV variants are called at a level of at least 1% with aminimum sequencing depth of 20,000×. Further provided herein are methodswherein the first unique molecular identifier and the second uniquemolecular identifier are selected from a set of no more than 64sequences. Further provided herein are methods wherein the first uniquemolecular identifier and the second unique molecular identifier areselected from a set of no more than 48 sequences. Further providedherein are methods wherein the first unique molecular identifier and thesecond unique molecular identifier are selected from a set of sequenceshaving a Hamming distance of at least 2. Further provided herein aremethods wherein the duplex efficiency comprises the number of duplexreads after sequencing after the duplex collapses divided by the totalnumber of input reads. Further provided herein are methods wherein thepolynucleotide adapter comprises a duplex efficiency of at least 4%.Further provided herein are methods wherein the polynucleotide adaptercomprises a duplex efficiency of at least 6%. Further provided hereinare methods wherein the method has a recall of at least 20% for singlenucleotide polymorphisms present at least at 0.2% abundance in thesample nucleic acids. Further provided herein are methods wherein thesingle nucleotide polymorphisms are present at less than 1% abundance inthe sample nucleic acids.

Provided herein are libraries of polynucleotide adapters comprising: atleast two polynucleotide adapters, each comprising: a first strand,wherein the first strand comprises a first terminal adapter region, afirst non-complementary region, a first yoke region, and a first uniquemolecular identifier; and a second strand, wherein the second strandcomprises a second terminal adapter region, a second non-complementaryregion, a second yoke region, and a second unique molecular identifier;wherein the first yoke region and the second yoke region arecomplementary, wherein the first non-complementary region and the secondnon-complementary region are not complementary, and wherein the fractionof each of the polynucleotides in the library is 1-5%. Further providedherein are polynucleotide adapters wherein the fraction of each of thepolynucleotides in the library is 1.5-4.5%. Further provided herein arepolynucleotide adapters wherein the library comprises at least 8different unique molecular identifiers. Further provided herein arepolynucleotide adapters wherein the library comprises at least 16different unique molecular identifiers. Further provided herein arepolynucleotide adapters the library comprises at least 32 differentunique molecular identifiers. Further provided herein are polynucleotideadapters the complementary first yoke region and second yoke region areeach less than 15 bases in length. Further provided herein arepolynucleotide adapters wherein the complementary first yoke region andsecond yoke region are each than 10 bases in length. Further providedherein are polynucleotide adapters the complementary first yoke regionand second yoke region are each less than 6 bases in length. Furtherprovided herein are polynucleotide adapters the first unique molecularidentifier is 4-8 bases in length. Further provided herein arepolynucleotide adapters the second unique molecular identifier is 4-8bases in length. Further provided herein are polynucleotide adapters thefirst unique molecular identifier and the second unique molecularidentifier are 5 or 6 bases in length. Further provided herein arepolynucleotide adapters the first unique molecular identifier and thesecond unique molecular identifier are complementary. Further providedherein are polynucleotide adapters the first unique molecular identifierand the second unique molecular identifier are not complementary.Further provided herein are polynucleotide adapters the first uniquemolecular identifier or the second unique molecular identifier comprisethe sequences of one or more of AAGGA, ACAAC, ATACG, CACTG, CATGA,CGATA, CGTGT, GCCAT, GCTGT, GTCAC, GTCGT, TACGA, TCCTA, TCGTG, TGTCG,TTGGC, AACACA, AATGCC, ACTAGG, AGCATC, AGTACA, ATCTCC, CAGACG, CAGTAC,CGAATC, CGGTTG, CTTGGA, GCATAG, GCTAAC, GTGAGA, GTGTCA, and TGTGCC.Further provided herein are polynucleotide adapters the first uniquemolecular identifier or the second unique molecular identifier comprisethe sequences of 10 or more of AAGGA, ACAAC, ATACG, CACTG, CATGA, CGATA,CGTGT, GCCAT, GCTGT, GTCAC, GTCGT, TACGA, TCCTA, TCGTG, TGTCG, TTGGC,AACACA, AATGCC, ACTAGG, AGCATC, AGTACA, ATCTCC, CAGACG, CAGTAC, CGAATC,CGGTTG, CTTGGA, GCATAG, GCTAAC, GTGAGA, GTGTCA, and TGTGCC.

Further provided herein are polynucleotide adapters the first uniquemolecular identifier or the second unique molecular identifier comprisethe sequences of one or more of AAGGA, ACAAC, ATACG, CACTG, CATGA,CGATA, CGTGT, GCCAT, GCTGT, GTCAC, GTCGT, TACGA, TCCTA, TCGTG, TGTCG,and TTGGC. Further provided herein are polynucleotide adapters the firstunique molecular identifier or the second unique molecular identifiercomprise the sequences of one or more of AACACA, AATGCC, ACTAGG, AGCATC,AGTACA, ATCTCC, CAGACG, CAGTAC, CGAATC, CGGTTG, CTTGGA, GCATAG, GCTAAC,GTGAGA, GTGTCA, and TGTGCC. Provided herein are polynucleotide adaptersfurther comprising a sample nucleic acid. Further provided herein arepolynucleotide adapters wherein the sample nucleic acid is DNA. Furtherprovided herein are polynucleotide adapters wherein the sample nucleicacid is genomic DNA. Further provided herein are polynucleotide adapterswherein the genomic DNA is of human origin. Further provided herein arepolynucleotide adapters wherein the first strand or the second strandfurther comprises at least one barcode. Further provided herein arepolynucleotide adapters wherein the at least one barcode is at least 8bases in length. Further provided herein are polynucleotide adapterswherein the at least one barcode is 8-12 bases in length. Furtherprovided herein are polynucleotide adapters wherein the at least onebarcode identifies the origin of the sample nucleic acid.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A depicts a workflow for attachment of adapters comprising uniquemolecular identifiers (UMIs) to a polynucleotide to form anadapter-ligated polynucleotide.

FIG. 1B depicts a workflow for amplification of adapter-ligatedpolynucleotides to form a library for sequencing.

FIG. 1C depicts a workflow for synthesis of a polynucleotide adaptercomprising a UMI.

FIG. 1D depicts a workflow for synthesis of a polynucleotide adaptercomprising a UMI, wherein the method comprises PCR extension of onestrand of the adapter.

FIG. 1E depicts a workflow for synthesis of a polynucleotide adaptercomprising a UMI, wherein the method comprises PCR extension of onestrand of the adapter, followed by restriction enzyme cleavage.

FIG. 1F depicts a workflow for synthesis of a polynucleotide adaptercomprising a UMI, wherein the method comprises restriction enzymecleavage.

FIG. 2A depicts a workflow for duplex sequencing analysis to identifyvariants. “*” indicates potential errors introduced by PCR orsequencing, and “+” indicates true variants.

FIG. 2B depicts a plot of mean target coverage for all reads, all markedup reads, and duplex reads generated using adapters comprising uniquemolecular identifiers.

FIG. 2C depicts a plot of the number of families vs. individual samplesfor “cs-families” (total groups of reads when reads with same start-stopand fragment size are collapsed same thing as regular mark duplicates),“ss-families” (total groups of reads when reads with the samestart-stop, fragment size and UMI are collapsed collapsing just on UMIinformation), “ds-families” (total groups of reads with the samestart-stop, fragment size, UMI and from two strands of same molecule arecollapsed same as ss-families, but additionally collapsed when reads areidentified to come from the same duplex molecule. However, if a pairfrom the other strand is not found, the read is not filtered out).10pctTruQ and 1pctTruQ refers to tiers of the Tru-Q NGS DNA ReferenceStandard having multiple endogenous SNPs, insertions and deletions.

FIG. 3A depicts a plot of the number of families vs. individual samplesfor cs-families, ss-families, and ds-families using 5 ng of sample.

FIG. 3B depicts a plot of the number of families vs. individual samplesfor cs-families, ss-families, and ds-families using 10 ng of sample.

FIG. 3C depicts a plot of the number of families vs. individual samplesfor cs-families, ss-families, and ds-families using 50 ng of sample.

FIG. 4A depicts a plot of the number of families vs. individual samplesfor cs-families, ss-families, and ds-families for 16 UMI pairs.

FIG. 4B depicts a plot of the number of families vs. individual samplesfor cs-families, ss-families, and ds-families for 24 UMI pairs.

FIG. 5A depicts a plot of the number of families vs. individual samplesfor cs-families, ss-families, and ds-families for 32 UMI pairs.

FIG. 5B depicts a plot of the number of families vs. individual samplesfor cs-families, ss-families, and ds-families for 39 UMI pairs.

FIG. 6A depicts a plot of variant calling accuracy for four differentgenes using 5, 10, or 50 ng of mass input, and 16, 24, 32, or 39 UMIpairs.

FIG. 6B depicts a plot of variant calling accuracy for four differentgenes using 5, 10, or 50 ng of mass input, and 0, 40, 80, 120, or 160×mean target coverage.

FIG. 7A depicts a plot of variant calling accuracy for four differentgenes using 5, 10, or 50 ng of mass input, and 7000× or 8000× meantarget coverage using all reads (no UMIs and no deduplication). Totalcoverage depth was 20,000×.

FIG. 7B depicts a plot of variant calling accuracy for four differentgenes using 5, 10, or 50 ng of mass input, and 7000× or 8000× meantarget coverage using all reads (no UMIs and no deduplication). Totalcoverage depth was 40,000×.

FIG. 8A depicts a plot of variant calling accuracy for four differentgenes using 5, 10, or 50 ng of mass input, and 7000× or 8000× meantarget coverage using duplex consensus reads. Total coverage depth was20,000×.

FIG. 8B depicts a plot of variant calling accuracy for four differentgenes using 5, 10, or 50 ng of mass input, and 7000× or 8000× meantarget coverage using duplex consensus reads. Total coverage depth was40,000×.

FIG. 9 depicts a schematic for fragmenting a sample, end repair,A-tailing, ligating universal adapters, and adding barcodes to theadapters via PCR amplification to generate a sequencing library.Additional steps optionally include enrichment, additional rounds ofamplification, and/or sequencing (not shown).

FIG. 10 depicts an image of a plate having 256 clusters, each clusterhaving 121 loci with polynucleotides extending therefrom.

FIG. 11A depicts a plot of polynucleotide representation (polynucleotidefrequency versus abundance, as measured absorbance) across a plate fromsynthesis of 29,040 unique polynucleotides from 240 clusters, eachcluster having 121 polynucleotides.

FIG. 11B depicts a plot of measurement of polynucleotide frequencyversus abundance absorbance (as measured absorbance) across eachindividual cluster, with control clusters identified by a box.

FIG. 12 illustrates a computer system.

FIG. 13 is a block diagram illustrating an architecture of a computersystem.

FIG. 14 is a diagram demonstrating a network configured to incorporate aplurality of computer systems, a plurality of cell phones and personaldata assistants, and Network Attached Storage (NAS).

FIG. 15 is a block diagram of a multiprocessor computer system using ashared virtual address memory space.

FIG. 16 depicts a UMI design which promotes channel balancing.

FIG. 17 depicts a plot of observed UMIs for a library by sequence. They-axis is labeled fraction raw observations from 0.00% to 5.50% at 0.50%intervals. The x-axis is labeled with UMI sequences

FIG. 18 depicts a plot of UMI efficiency, measured as the number ofduplex reads after duplex collapse divided by the total number of inputreads. The y-axis is labeled duplex efficiency from 0-10% at 1%intervals, and the x-axis is labeled UMI blend (GBS, LGC_1, LGC_2). Theleft bar in each set indicates an inexperienced operator, and the rightbar in each set indicates an experienced operator.

FIG. 19A depicts a plot of recall (only SBS (single base substitution)variants). The y-axis is labeled 80-100% at 5% intervals.

FIG. 19B depicts a plot of recall over SBSs (single base substitutions),pan cancer control. The y-axis is labeled recall from 0-100% at 20%intervals. The x-axis is labeled with variant allele frequencies (0.1,0.2, 0.5, 1.0, 2.0, 5%). The left bar in each set depicts calls, and theright bar in each set depicts pileups.

FIG. 20A depicts the left half of an adapter-ligated gDNA, including i5series barcode and UMI index.

FIG. 20B depicts the right half of an adapter-ligated gDNA, including i7series barcode and UMI index.

DETAILED DESCRIPTION

Described herein are compositions and methods for next generationsequencing. Further provided herein are polynucleotide adaptersconfigured for use with a polynucleotide library (e.g., from a sample).Such libraries in some instances comprise genomic DNA, RNA, or othernucleic acid. Further described herein are adapters which compriseunique molecular identifiers (UMIs). UMIs in some instances provide foruniquely identification of individual members of a polynucleotidelibrary, which enables molecular counting and identification ofpotential errors generated during preparation of a polynucleotidelibrary prior to sequencing.

Definitions

Throughout this disclosure, numerical features are presented in a rangeformat. It should be understood that the description in range format ismerely for convenience and brevity and should not be construed as aninflexible limitation on the scope of any embodiments. Accordingly, thedescription of a range should be considered to have specificallydisclosed all the possible subranges as well as individual numericalvalues within that range to the tenth of the unit of the lower limitunless the context clearly dictates otherwise. For example, descriptionof a range such as from 1 to 6 should be considered to have specificallydisclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual valueswithin that range, for example, 1.1, 2, 2.3, 5, and 5.9. This appliesregardless of the breadth of the range. The upper and lower limits ofthese intervening ranges may independently be included in the smallerranges, and are also encompassed within the invention, subject to anyspecifically excluded limit in the stated range. Where the stated rangeincludes one or both of the limits, ranges excluding either or both ofthose included limits are also included in the invention, unless thecontext clearly dictates otherwise.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of any embodiment.As used herein, the singular forms “a,” “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof. As used herein, the term “and/or”includes any and all combinations of one or more of the associatedlisted items.

Unless specifically stated or obvious from context, as used herein, theterm “about” in reference to a number or range of numbers is understoodto mean the stated number and numbers +/−10% thereof, or 10% below thelower listed limit and 10% above the higher listed limit for the valueslisted for a range.

As used herein, the terms “preselected sequence”, “predefined sequence”or “predetermined sequence” are used interchangeably. The terms meanthat the sequence of the polymer is known and chosen before synthesis orassembly of the polymer. In particular, various aspects of the inventionare described herein primarily with regard to the preparation of nucleicacids molecules, the sequence of the oligonucleotide or polynucleotidebeing known and chosen before the synthesis or assembly of the nucleicacid molecules.

The term nucleic acid encompasses double- or triple-stranded nucleicacids, as well as single-stranded molecules. In double- ortriple-stranded nucleic acids, the nucleic acid strands need not becoextensive (i.e., a double-stranded nucleic acid need not bedouble-stranded along the entire length of both strands). Nucleic acidsequences, when provided, are listed in the 5′ to 3′ direction, unlessstated otherwise. Methods described herein provide for the generation ofisolated nucleic acids. Methods described herein additionally providefor the generation of isolated and purified nucleic acids. The length ofpolynucleotides, when provided, are described as the number of bases andabbreviated, such as nt (nucleotides), bp (bases), kb (kilobases), Mb(megabases) or Gb (gigabases).

Provided herein are methods and compositions for production of synthetic(i.e. de novo synthesized or chemically synthesizes) polynucleotides.The term oligonucleic acid, oligonucleotide, oligo, and polynucleotideare defined to be synonymous throughout. Libraries of synthesizedpolynucleotides described herein may comprise a plurality ofpolynucleotides collectively encoding for one or more genes or genefragments. In some instances, the polynucleotide library comprisescoding or non-coding sequences. In some instances, the polynucleotidelibrary encodes for a plurality of cDNA sequences. Reference genesequences from which the cDNA sequences are based may contain introns,whereas cDNA sequences exclude introns. Polynucleotides described hereinmay encode for genes or gene fragments from an organism. Exemplaryorganisms include, without limitation, prokaryotes (e.g., bacteria) andeukaryotes (e.g., mice, rabbits, humans, and non-human primates). Insome instances, the polynucleotide library comprises one or morepolynucleotides, each of the one or more polynucleotides encodingsequences for multiple exons. Each polynucleotide within a librarydescribed herein may encode a different sequence, i.e., non-identicalsequence. In some instances, each polynucleotide within a librarydescribed herein comprises at least one portion that is complementary tosequence of another polynucleotide within the library. Polynucleotidesequences described herein may be, unless stated otherwise, comprise DNAor RNA. A polynucleotide library described herein may comprise at least10, 20, 50, 100, 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 30,000,50,000, 100,000, 200,000, 500,000, 1,000,000, or more than 1,000,000polynucleotides. A polynucleotide library described herein may have nomore than 10, 20, 50, 100, 200, 500, 1,000, 2,000, 5,000, 10,000,20,000, 30,000, 50,000, 100,000, 200,000, 500,000, or no more than1,000,000 polynucleotides. A polynucleotide library described herein maycomprise 10 to 500, 20 to 1000, 50 to 2000, 100 to 5000, 500 to 10,000,1,000 to 5,000, 10,000 to 50,000, 100,000 to 500,000, or 50,000 to1,000,000 polynucleotides. A polynucleotide library described herein maycomprise about 370,000; 400,000; 500,000 or more differentpolynucleotides.

Unique Molecular Identifiers

Described herein are adapters comprising unique molecular identifiers(UMIs). Adapters in some instances comprise a structure 100 of FIG. 1.In some instances, adapters comprise universal adapters. In someinstances adapters comprise a Y-annealing region (anneals to form yoke),one or more Y-step non-annealing regions, a first index region 101 a, asecond index region 101 b, a first UMI (index) region 102 a, a secondUMI (index) region 102 b, and one or more regions exterior to the index.In some instances, adapters 100 are ligated 104 to samplepolynucleotides 103 to form an adapter-ligated polynucleotide 105. Afterdenaturation 106 of 105 (FIG. 1A), top 107 a and bottom 107 b strandligation products are formed. In some instances, each strand is labeledwith a different UMI. After amplification 109 with forward 108 a andbackward 108 b primers, top strand 110 a and bottom strand 110 b PCRproducts are generated. In some instances, adapter ligatedpolynucleotides generated with universal adapters are further amplifiedwith barcoded primers. In some instances adapters described hereincomprise “in-line” UMIs, wherein at least one of a 5′ or 3′ UMI is notcomplementary to the other corresponding strand of the adapter (101 aand 101 b are not complementary). In some instances adapters describedherein comprise “duplex” UMIs, wherein at least one of a 5′ or 3′ UMI iscomplementary to the other corresponding strand of the adapter (101 aand 101 b are complementary).

Adapter-ligated libraries comprising unique molecular identifiers may beused to distinguish between “true” mutations from a polynucleotidesample library and artifacts generated during sequencing librarypreparation (e.g., PCR errors, sequencing errors, or other erroneousbase call). In some instances, a workflow as shown in FIG. 2A is used toanalyze a library of adapter-ligated sample polynucleotides 201.Adapter-ligated sample polynucleotides 201 each comprise two distinctUMIs 201 b represented by letters (A-F; six combinations of barcodes areshown for simplicity), and are attached to a sample polynucleotide 201c. After sequencing 206, forward and reverse read pairs 202 fromsequencing are sorted into read pair groups 202 a. Potential PCR-basederrors are designated with “*”, and true polymorphisms are designated as“+”. Next, read pairs 203 are grouped 207 by barcode and barcodeposition. Single-stranded consensus sequences 204 are then generated 208from each group of barcode-grouped read pairs. Errors from D-C, and F-Eare identified, although the error in A-B remains. Finally, duplexconsensus sequences 205 are generated 209 by comparing each set ofsingle stranded consensus sequences. The error in A-B can be identified,and true mutation E-F can be confirmed. In some instances, errorsinclude substitutions, deletions, or insertions. In some instances, anerror is present in the sample polynucleotide portion of anadapter-ligated polynucleotide. In some instances, an error is presentin a barcode configured to identify a sample origin (e.g., index) or touniquely identify a sample polynucleotide. In some instances, an erroris present in a UMI. In some instances, an error is present in a sampleindex. Compositions and methods described herein in some instances areused to identify such errors.

Described herein are sets of UMIs, wherein the set has definedproperties. In some instances, a UMI set comprises a plurality ofdifferent polynucleotides having unique sequences. In some instances, aUMI set is 8, 12, 16, 20, 24, 30, 32, 36, 39, 48, or 64 uniquesequences. In some instances, the sequences of a UMI set differ by aHamming distance of no more than 1, 2, 3, 4, or 5. In some instances,the sequences of a UMI set differ by a Hamming distance of at least 1,2, 3, 4, or 5. In some instances, the sequences of a UMI set differ by aHamming distance of at least 2. In some instances, the sequences of aUMI set differ by a Hamming distance of at least 1.

UMIs may be any length, depending on the desired application. In someinstances, a UMI is no more than 15, 12, 10, 8, 7, 6, 5, 4, or not morethan 3 bases in length. In some instances, a UMI is about 15, 12, 10, 8,7, 6, 5, 4, or about 3 bases in length. In some instances, a UMI isabout 3-12, 3-10, 3-8. 4-12, 4-10, 4-8, 6-12, or 8-12 bases in length.UMIs in a set may comprise more than one length. In some instances, 10,20, 25, 30, 40, 50, 60, or 70 percent of UMIs in the set are a firstlength, and 90, 80, 75, 70, 60, 50, 40, or 30 percent are a secondlength. In some instances, the first length is 3-5 bases, and the secondlength is 3-5 bases. In some instances, UMIs comprise lengths of 5 or 6bases.

After addition of UMI-containing adapters to sample polynucleotides, atleast some of the sample polynucleotides may be uniquely labeled. Insome instances, at least 30%, 50%, 75%, 80%, 90%, 95%, or at least 98%of the sample polynucleotides are ligated to adapters comprising UMIs.In some instances, at least 1%, 2%, 5%, 10%, 15%, 20%, 30%, 50%, 75%,80%, 90%, 95%, or at least 98% of the sample polynucleotides are labeledwith a unique UMI sequence. In some instances, no more than 1%, 2%, 5%,10%, 15%, 20%, 30%, 50%, 75%, 80%, 90%, 95%, or no more than 98% of thesample polynucleotides are labeled with a unique UMI sequence. In someinstances, at least 1%, 2%, 5%, 10%, 15%, 20%, 30%, 50%, 75%, 80%, 90%,95%, or at least 98% of the sample polynucleotides are uniquelyidentifiable after labeling with a UMI.

UMIs described herein in some instances comprise sequences of one ormore of AAGGA, ACAAC, ATACG, CACTG, CATGA, CGATA, CGTGT, GCCAT, GCTGT,GTCAC, GTCGT, TACGA, TCCTA, TCGTG, TGTCG, TTGGC, AACACA, AATGCC, ACTAGG,AGCATC, AGTACA, ATCTCC, CAGACG, CAGTAC, CGAATC, CGGTTG, CTTGGA, GCATAG,GCTAAC, GTGAGA, GTGTCA, and TGTGCC. UMIs described herein in someinstances comprise sequences of two or more of AAGGA, ACAAC, ATACG,CACTG, CATGA, CGATA, CGTGT, GCCAT, GCTGT, GTCAC, GTCGT, TACGA, TCCTA,TCGTG, TGTCG, TTGGC, AACACA, AATGCC, ACTAGG, AGCATC, AGTACA, ATCTCC,CAGACG, CAGTAC, CGAATC, CGGTTG, CTTGGA, GCATAG, GCTAAC, GTGAGA, GTGTCA,and TGTGCC. UMIs described herein in some instances comprise sequencesof five or more of AAGGA, ACAAC, ATACG, CACTG, CATGA, CGATA, CGTGT,GCCAT, GCTGT, GTCAC, GTCGT, TACGA, TCCTA, TCGTG, TGTCG, TTGGC, AACACA,AATGCC, ACTAGG, AGCATC, AGTACA, ATCTCC, CAGACG, CAGTAC, CGAATC, CGGTTG,CTTGGA, GCATAG, GCTAAC, GTGAGA, GTGTCA, and TGTGCC. UMIs describedherein in some instances comprise sequences of ten or more of AAGGA,ACAAC, ATACG, CACTG, CATGA, CGATA, CGTGT, GCCAT, GCTGT, GTCAC, GTCGT,TACGA, TCCTA, TCGTG, TGTCG, TTGGC, AACACA, AATGCC, ACTAGG, AGCATC,AGTACA, ATCTCC, CAGACG, CAGTAC, CGAATC, CGGTTG, CTTGGA, GCATAG, GCTAAC,GTGAGA, GTGTCA, and TGTGCC.

UMIs may be represented at pre-selected percentages among a library ofUMIs. In some instances at least 90% of the UMIs are present at fractionof 1-5%. In some instances at least 90% of the UMIs are present atfraction of 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%,7%, or 8%. In some instances at least 90% of the UMIs are present atfraction of 0.5-8%, 1-7%, 1.5-7%, 2-7%, 2.5-6%, 3-8%, 3-6%, 1-5%,0.5-5.5%, 1-4%, 1-6%, or 1-8%.

Any amount of sample polynucleotides (e.g., input DNA or other nucleicacid) may be ligated to adapters described herein. In some instances,the amount of sample polynucleotides is about 1, 5, 8, 10, 15, 20, 25,30, 50, 75, or about 100 ng. In some instances, the amount of samplepolynucleotides is no more than 1, 5, 8, 10, 15, 20, 25, 30, 50, 75, orno more than 100 ng. In some instances, the amount of samplepolynucleotides is at least 1, 5, 8, 10, 15, 20, 25, 30, 50, 75, or atleast 100 ng. In some instances, the amount of sample polynucleotides1-10 ng, 1-100 ng, 3-10 ng, 5-100 ng, 5-75 ng, 5-50 ng, 10-100 ng, 10-50ng, 25-100 ng, or 25-75 ng.

Provided herein are methods of generating adapters comprising UMIs. In afirst method of adapter synthesis comprising synthesis of a top strandof an adapter comprising at least one UMI and a complementary bottomstrand. After annealing the top and bottom adapter strands, an adaptercomprising the structure of adapter 100 is formed (FIG. 1C). In a secondmethod of adapter synthesis, a top strand is synthesized without a UMI,and a bottom strand comprising a complementary region and a UMI (FIG.1D). After, annealing, PCR is used to generate a complementary UMI onthe top strand, and a terminal transferase adds a T to the 3′ end of topstrand to generate adapter 100. In a third method of synthesis, a topstrand which does not comprise a UMI, and a bottom strand comprising aUMI, a restrictions site, and a 5′ overhang are synthesized (FIG. 1E).After annealing, the top strand is extended with PCR, and a restrictionendonuclease is used to cleave a portion of the 3′ top strand and 5′bottom strand to generate adapter 100. In a fourth method of adaptersynthesis, two complementary strands each comprising a UMI, arestriction site, and an overhang portion (3′ top strand, 5′ bottomstrand) are synthesized, annealed, and cleaved with a restriction enzymeto generate adapter 100. More than one UMIs may be present per adapter.In some instances, an adapter comprises 1, 2, 3, 4, 5, or more UMIs. Insome instances, adapters comprise a first UMI and a second UMI. In someinstances, a first UMI and a second UMI are complementary. In someinstances, adapters comprise a first UMI and a second UMI. In someinstances, a first UMI and a second UMI are not complementary. In someinstances adapters are combined into libraries of adapters. In someinstances adapters in a library comprise UMIs. In some instancesadapters in a library comprise unique combinations of a first UMI and asecond UMI.

Methylome Analysis

Analysis of the methylome may provide important information onbiological processes for a given genomic sample. In some instances,methylated bases in a genomic sample are identified by either (a)conversion of a methylated base to a different base, or (b) conversionof a non-methylated base to a different base. Such conversions in someinstances are performed on whole genomes or genomic fragments. Theresulting sequences are then compared to a reference sequence (obtainedwithout conversion/treatment) to identify which bases are methylated. Insome instances, a conversion method (or process) comprises treatmentwith a deamination reagent. In some instances, a conversion methodcomprises treatment with bisulfate. In some instances, a conversionmethod comprises treatment with a reagent to protect methylcytosines(e.g., TET2 for oxidation), followed by treatment with an enzyme todeaminate unprotected cytosines (e.g., APOBEC). Additional reagentswhich differentiate methylated and non-methylated bases are alsoconsistent with the methods disclosed herein. In some instances,unmethylated cytosines are converted to uracil. In some instances, PCRamplification of these uracil-containing modified genomes results inconversion of uracil to thymine. In some instances, adapters describedherein are modified to replace cytosines with methylcytosines or otherbase which resists conversion.

Universal Adapters

Provided herein are universal adapters. In some instances, universaladapters comprise one or more unique molecular identifiers. In someinstances, the universal adapters disclosed herein may comprise auniversal polynucleotide adapter comprising a first strand and a secondstrand. In some instances, a first strand comprises a first primerbinding region, a first non-complementary region, and a first yokeregion. In some instances, a second strand comprises a second primerbinding region, a second non-complementary region, and a second yokeregion. In some instances, a primer binding region allows for PCRamplification of a polynucleotide adapter. In some instances, a primerbinding region allows for PCR amplification of a polynucleotide adapterand concurrent addition of one or more barcodes to the polynucleotideadapter. In some instances, the first yoke region is complementary tothe second yoke region. In some instances, the first non-complementaryregion is not complementary to the second non-complementary region. Insome instances, the universal adapter is a Y-shaped or forked adapter.In some instances, one or more yoke regions comprise nucleobaseanalogues that raise the Tm between a first yoke region and a secondyoke region. Primer binding regions as described herein may be in theform of a terminal adapter region of a polynucleotide. In someinstances, a universal adapter comprises one index sequence. In someinstances, a universal adapter comprises one unique molecularidentifier. In some instances, universal adapters are configured for usewith barcoded primers, wherein after ligation, barcoded primers areadded via PCR.

A universal (polynucleotide) adapter may be shortened relative to atypical barcoded adapter (e.g., full-length “Y adapter”). For example, auniversal adapter strand is 20-45 bases in length. In some instances, auniversal adapter strand is 25-40 bases in length. In some instances, auniversal adapter strand is 30-35 bases in length. In some instances, auniversal adapter strand is no more than 50 bases in length, no morethan 45 bases in length, no more than 40 bases in length, no more than35 bases in length, no more than 30 bases in length, or no more than 25bases in length. In some instances, a universal adapter strand is about25, 27, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, orabout 60 bases in length. In some instances, a universal adapter strandis about 60 base pairs in length. In some instances, a universal adapterstrand is about 58 base pairs in length. In some instances, a universaladapter strand is about 52 base pairs in length. In some instances, auniversal adapter strand is about 33 base pairs in length.

A universal adapter may be modified to facilitate ligation with a samplepolynucleotide. For example, the 5′ terminus is phosphorylated. In someinstances, a universal adapter comprises one or more non-nativenucleobase linkages such as a phosphorothioate linkage. For example, auniversal adapter comprises a phosphorothioate between the 3′ terminalbase, and the base adjacent to the 3′ terminal base. A samplepolynucleotide in some instances comprises nucleic acid from a varietyof sources, such as DNA or RNA of human, bacterial, plant, animal,fungal, or viral origin. An adapter-ligated sample polynucleotide insome instances comprises a sample polynucleotide (e.g., sample nucleicacid) with adapters universal adapters ligated to both the 5′ and 3′ endof the sample polynucleotide to form an adapter-ligated polynucleotide.A duplex sample polynucleotide comprises both a first strand (forward)and a second strand (reverse).

Universal adapters may contain any number of different nucleobases (DNA,RNA, etc.), nucleobase analogues, or non-nucleobase linkers or spacers.For example, an adapter comprises one or more nucleobase analogues orother groups that enhance hybridization (T_(m)) between two strands ofthe adapter. In some instances, nucleobase analogues are present in theyoke region of an adapter. Nucleobase analogues and other groups includebut are not limited to locked nucleic acids (LNAs), bicyclic nucleicacids (BNAs), CS-modified pyrimidine bases, 2′-O-methyl substituted RNA,peptide nucleic acids (PNAs), glycol nucleic acid (GNAs), threosenucleic acid (TNAs), xenonucleic acids (XNAs) morpholinobackbone-modified bases, minor grove binders (MGBs), spermine, G-clamps,or a anthraquinone (Uaq) caps. In some instances, adapters comprise oneor more nucleobase analogues selected from Table 1.

TABLE 1 Base A T G Locked Nucleic Acid (LNA)

Bridged Nucleic Acid* (BNA)

Base C U Locked Nucleic Acid (LNA)

Bridged Nucleic Acid* (BNA)

*R is H or Me.

Universal adapters may comprise any number of nucleobase analogues (suchas LNAs or BNAs), depending on the desired hybridization T_(m). Forexample, an adapter comprises 1 to 20 nucleobase analogues. In someinstances, an adapter comprises 1 to 8 nucleobase analogues. In someinstances, an adapter comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, or at least 12 nucleobase analogues. In some instances, anadapter comprises about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, or about 16 nucleobase analogues. In some instances, the number ofnucleobase analogous is expressed as a percent of the total bases in theadapter. For example, an adapter comprises at least 1%, 2%, 5%, 10%,12%, 18%, 24%, 30%, or more than 30% nucleobase analogues. In someinstances, adapters (e.g., universal adapters) described herein comprisemethylated nucleobases, such as methylated cytosine.

Barcodes

Polynucleotide primers may comprise defined sequences, such as barcodes(or indices). Adapters in some instances comprise one or more barcodes.In some instances, an adapter comprises at least one indexing barcodeand at least one unique molecular identifier barcode. Barcodes can beattached to universal adapters, for example, using PCR and barcodedprimers to generate barcoded adapter-ligated sample polynucleotides.Primer binding sites, such as universal primer binding sites, facilitatesimultaneous amplification of all members of a barcode primer library,or a subpopulation of members. In some instances, a primer binding sitecomprises a region that binds to a flow cell or other solid supportduring next generation sequencing. In some instances, a barcoded primercomprises a P5 (5′-AATGATACGGCGACCACCGA-3′) or P7(5′-CAAGCAGAAGACGGCATACGAGAT-3′) sequence. In some instances, primerbinding sites are configured to bind to universal adapter sequences, andfacilitate amplification and generation of barcoded adapters. In someinstances, barcoded primers are no more than 60 bases in length. In someinstances, barcoded primers are no more than 55 bases in length. In someinstances, barcoded primers are 50-60 bases in length. In someinstances, barcoded primers are about 60 bases in length. In someinstances, barcodes described herein comprise methylated nucleobases,such as methylated cytosine.

The number of unique barcodes available for a barcode set (collection ofunique barcodes or barcode combinations configured to be used togetherto unique define samples) may depend on the barcode length. In someinstances, a Hamming distance is defined by the number of basedifferences between any two barcodes. In some instances, a Levenshteindistance is defined by the number changes needed to change one barcodeinto another (insertions, substitutions, or deletions). In someinstances, barcode sets described herein comprise a Levenshtein distanceof at least 2, 3, 4, 5, 6, 7, or at least 8. In some instances, barcodesets described herein comprise a Hamming distance of at least 2, 3, 4,5, 6, 7, or at least 8.

Barcodes may be incorrectly associated with a different sample than theywere assigned. In some instances, incorrect barcodes are occur from PCRerrors (e.g., substitution) during library amplification. In someinstances, entire barcodes “hop” or are transferred from one samplepolynucleotide to another. Such transfers in some instances result fromcross-contamination of free adapters or primers during a librarygeneration workflow. In some instances a group of barcodes (barcode set)is chosen to minimize “barcode hopping”. In some instances, barcodehopping (for a single barcode) for a barcode set described herein is nomore than 7%, 5%, 4%, 3%, 2%, 1%, 0.5%, or no more than 0.1%. In someinstances, barcode hopping (for a single barcode) for a barcode setdescribed herein is 0.1-6%, 0.1-5%, 0.2-5%, 0.5-5%, 1-7%, 1-5%, or0.5-7%. In some instances, barcode hopping (for two barcodes) for abarcode set described herein is no more than 0.7%, 0.5%, 0.4%, 0.3%,0.2%, 0.1%, 0.05%, or no more than 0.1%. In some instances, barcodehopping (for two barcodes) for a barcode set described herein is0.01-0.6%, 0.01-0.5%, 0.02-0.5%, 0.05-0.5%, 0.1-0.7%, 0.1-0.5%, or0.05-0.7%.

Barcoded primers comprise one or more barcodes. In some instances, thebarcodes are added to universal adapters through PCR reaction. Barcodesare nucleic acid sequences that allow some feature of a polynucleotidewith which the barcode is associated to be identified. In someinstances, a barcode comprises an index sequence. In some instances,index sequences allow for identification of a sample, or unique sourceof nucleic acids to be sequenced. A barcode or combination of barcodesin some instances identifies a specific patient. A barcode orcombination of barcodes in some instances identifies a specific samplefrom a patient among other samples from the same patient. Aftersequencing, the barcode (or barcode region) provides an indicator foridentifying a characteristic associated with the coding region or samplesource. Barcodes can be designed at suitable lengths to allow sufficientdegree of identification, e.g., at least about 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,47, 48, 49, 50, 51, 52, 53, 54, 55, or more bases in length. Multiplebarcodes, such as about 2, 3, 4, 5, 6, 7, 8, 9, 10, or more barcodes,may be used on the same molecule, optionally separated by non-barcodesequences. In some instances, a barcode is positioned on the 5′ and the3′ sides of a sample polynucleotide. In some instances, each barcode ina plurality of barcodes differ from every other barcode in the pluralityat least three base positions, such as at least about 3, 4, 5, 6, 7, 8,9, 10, or more positions. Use of barcodes allows for the pooling andsimultaneous processing of multiple libraries for downstreamapplications, such as sequencing (multiplex). In some instances, atleast 4, 8, 16, 32, 48, 64, 128, or more 512 barcoded libraries areused. In some instances, at least 400, 500, 800, 1000, 2000, 5000,10,000, 12,000, 15,000, 18,000, 20,000, or at 25,000 barcodes are used.Barcoded primers or adapters may comprise unique molecular identifiers(UMI). Such UMIs in some instances uniquely tag all nucleic acids in asample. In some instances, at least 60%, 70%, 80%, 90%, 95%, or morethan 95% of the nucleic acids in a sample are tagged with a UMI. In someinstances, at least 85%, 90%, 95%, 97%, or at least 99% of the nucleicacids in a sample are tagged with a unique barcode, or UMI. Barcodedprimers in some instances comprise an index sequence and one or moreUMI. UMIs allow for internal measurement of initial sampleconcentrations or stoichiometry prior to downstream sample processing(e.g., PCR or enrichment steps) which can introduce bias. In someinstances, UMIs comprise one or more barcode sequences. In someinstances, each strand (forward vs. reverse) of an adapter-ligatedsample polynucleotide possesses one or more unique barcodes. Suchbarcodes are optionally used to uniquely tag each strand of a samplepolynucleotide. In some instances, a barcoded primer comprises an indexbarcode and a UMI barcode. In some instances, after amplification withat least two barcoded primers, the resulting amplicons comprise twoindex sequences and two UMIs. In some instances, after amplificationwith at least two barcoded primers, the resulting amplicons comprise twoindex barcodes and one UMI barcode. In some instances, each strand of auniversal adapter-sample polynucleotide duplex is tagged with a uniquebarcode, such as a UMI or index barcode.

Barcoded primers in a library comprise a region that is complementary toa primer binding region on a universal adapter. For example, universaladapter binding region is complementary to primer region of theuniversal adapter, and universal adapter binding region is complementaryto primer region of the universal adapter. Such arrangements facilitateextension of universal adapters during PCR, and attach barcoded primers.In some instances, the Tm between the primer and the primer bindingregion is 40-65 degrees C. In some instances, the Tm between the primerand the primer binding region is 42-63 degrees C. In some instances, theTm between the primer and the primer binding region is 50-60 degrees C.In some instances, the Tm between the primer and the primer bindingregion is 53-62 degrees C. In some instances, the Tm between the primerand the primer binding region is 54-58 degrees C. In some instances, theTm between the primer and the primer binding region is 40-57 degrees C.In some instances, the Tm between the primer and the primer bindingregion is 40-50 degrees C. In some instances, the Tm between the primerand the primer binding region is about 40, 45, 47, 50, 52, 53, 55, 57,59, 61, or 62 degrees C.

Hybridization Blockers

Blockers may contain any number of different nucleobases (DNA, RNA,etc.), nucleobase analogues (non-canonical), or non-nucleobase linkersor spacers. In some instances, blockers comprise universal blockers.Such blockers may in some instances are described as a “set”, whereinthe set comprises two or more blockers configured to prevent unwantedinteractions with the same adapter sequence. In some instances,universal blockers prevent adapter-adapter interactions independent ofone or more barcodes present on at least one of the adapters. Forexample, a blocker comprises one or more nucleobase analogues or othergroups that enhance hybridization (T_(m)) between the blocker and theadapter. In some instances, a blocker comprises one or more nucleobaseswhich decrease hybridization (T_(m)) between the blocker and the adapter(e.g., “universal” bases). In some instances, a blocker described hereincomprises both one or more nucleobases which increase hybridization(T_(m)) between the blocker and the adapter and one or more nucleobaseswhich decrease hybridization (T_(m)) between the blocker and theadapter.

Described herein are hybridization blockers comprising one or moreregions which enhance binding to targeted sequences (e.g., adapter), andone or more regions which decrease binding to target sequences (e.g.,adapter). In some instances, each region is tuned for a given desiredlevel of off-bait activity during target enrichment applications. Insome instances, each region can be altered with either a single type ofchemical modification/moiety or multiple types to increase or decreaseoverall affinity of a molecule for a targeted sequence. In someinstances, the melting temperature of all individual members of ablocker set are held above a specified temperature (e.g., with theaddition of moieties such as LNAs and/or BNAs). In some instances, agiven set of blockers will improve off bait performance independent ofindex length, independent of index sequence, and independent of how manyadapter indices are present in hybridization.

Blockers may comprise moieties which increase and/or decrease affinityfor a target sequencing, such as an adapter. In some instances, suchspecific regions can be thermodynamically tuned to specific meltingtemperatures to either avoid or increase the affinity for a particulartargeted sequence. This combination of modifications is in someinstances designed to help increase the affinity of the blocker moleculefor specific and unique adapter sequence and decrease the affinity ofthe blocker molecule for repeated adapter sequence (e.g., Y-stemannealing portion of adapter). In some instances, blockers comprisemoieties which decrease binding of a blocker to the Y-stem region of anadapter. In some instances, blockers comprise moieties which decreasebinding of a blocker to the Y-stem region of an adapter, and moietieswhich increase binding of a blocker to non-Y-stem regions of an adapter.

Blockers (e.g., universal blockers) and adapters may form a number ofdifferent populations during hybridization. In a population ‘A’ in someinstances comprises blockers correctly bound to non-index regions of theadapters. In a population ‘B’, a region of the blockers is bound to the“yoke” region of the adapter, but a remaining portion of the blockerdoes not bind to an adjacent region of the adapter. In a population ‘C’,two blockers unproductively dimerize. In a population ‘D’, blockers areunbound to any other nucleic acids. In some instances, when the numberof DNA modifications that decrease affinity in the Y-stem annealingregion of the blocker are increased, the populations ‘A’ & ‘D’ dominateand either have the desired or minimal effect. In some instances, as thenumber of DNA modifications that decrease affinity in the Y-stemannealing region of the blocker are decreased, the populations ‘B’ & ‘C’dominate and have undesired effects where daisy-chaining or annealing toother adapters can occur (‘B’) or sequester blockers where they areunable to function properly (‘C’).

The index on both single or dual index adapter designs may be eitherpartially or fully covered by universal blockers that have been extendedwith specifically designed DNA modifications to cover adapter indexbases. In some instances, such modifications comprise moieties whichdecrease annealing to the index, such as universal bases. In someinstances, the index of a dual index adapter is partially covered (or isoverlapped) by one or more blockers. In some instances, the index of adual index adapter is fully covered by one or more blockers. In someinstances, the index of a single index adapter is partially covered byone or more blockers. In some instances, the index of a single indexadapter is fully covered by one or more blockers. In some instances, ablocker overlaps an index sequence by at least 1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, 12, 13, 14, 15, 20 or more than 20 bases. In some instances,a blocker overlaps an index sequence by no more than 1, 2, 3, 4, 5, 6,7, 8, 9, 10, 11, 12, 13, 14, 15, 20, or no more than 25 bases. In someinstances, a blocker overlaps an index sequence by about 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20 or about 30 bases. In someinstances, a blocker overlaps an index sequence by 1-5, 1-3, 2-5, 2-8,2-10, 3-6, 3-10, 4-10, 4-15, 1-4 or 5-7 bases. In some instances, aregion of a blocker which overlaps an index sequences comprises at leastone 2-deoxyinosine or 5-nitroindole nucleobase.

One or two blockers may overlap with an index sequence present on anadapter. In some instances, one or two blockers combined overlap with atleast 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20 or more than20 bases of the index sequence. In some instances, one or two blockerscombined overlap with no more than 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 20 or no more than 20 bases of the index sequence. Insome instances, one or two blockers combined overlap with about 1, 2, 3,4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20 or about 20 bases of theindex sequence. In some instances, one or two blockers combined overlapby 1-5, 1-3, 2-5, 2-8, 2-10, 3-6, 3-10, 4-10, 4-15, 1-4 or 5-7 bases ofthe index sequence. In some instances, a region of a blocker whichoverlaps an index sequences comprises at least one 2-deoxyinosine or5-nitroindole nucleobase.

In a first arrangement, the length of the adapter index overhang may bevaried. When designed from a single side, the adapter index overhang canbe altered to cover from 0 to n of the adapter index bases from eitherside of the index. This allows for the ability to design such adapterblockers for both single and dual index adapter systems.

In a second arrangement, the adapter index bases are covered from bothsides. When adapter index bases are covered from both sides, the lengthof the covering region of each blocker can be chosen such that a singlepair of blockers is capable of interacting with a range of adapter indexlengths while still covering a significant portion of the total numberof index bases. As an example, take two blockers that have been designedwith 3 bp overhangs that cover the adapter index. In the context of 6bp, 8 bp, or 10 bp adapter index lengths, these blockers will leave 0bp, 2 bp, or 4 bp exposed during hybridization, respectively.

In a third arrangement, modified nucleobases are selected to cover indexadapter bases. Examples of these modifications that are currentlycommercially available include degenerate bases (i.e., mixed bases of A,T, C, G), 2′-deoxylnosine, & 5-nitroindole.

In a forth arrangement, blockers with adapter index overhangs bind toeither the sense (i.e., ‘top’) or anti-sense (i.e., ‘bottom’) strand ofa next generation sequencing library.

In a fifth arrangement, blockers are further extended to cover otherpolynucleotide sequences (e.g., a poly-A tail added in a previousbiochemical step in order to facilitate ligation or other method tointroduce a defined adapter sequence, unique molecular identifier forbioinformatic assignment following sequencing, etc.) in addition to thestandard adapter index bases of defined length and composition. Thesetypes of sequences can be placed in multiple locations of an adapter andin this case the most widely utilized case (i.e., unique molecular indexnext to the genomic insert) is presented. Other positions for the uniquemolecular identifier (e.g., next to adapter index bases) could also beaddressed with similar approaches.

In a sixth arrangement, all of the previous arrangements are utilized invarious combinations to meet a targeted performance metric for off-baitperformance during target enrichment under specified conditions.

Blockers may comprise moieties, such as nucleobase analogues. Nucleobaseanalogues and other groups include but are not limited to locked nucleicacids (LNAs), bicyclic nucleic acids (BNAs), CS-modified pyrimidinebases, 2′-O-methyl substituted RNA, peptide nucleic acids (PNAs), glycolnucleic acid (GNAs), threose nucleic acid (TNAs), inosine,2′-deoxylnosine, 3-nitropyrrole, 5-nitroindole, xenonucleic acids (XNAs)morpholino backbone-modified bases, minor grove binders (MGBs),spermine, G-clamps, or a anthraquinone (Uaq) caps. In some instances,nucleobase analogues comprise universal bases, wherein the nucleobasehas a lower Tm for binding to a cognate nucleobase. In some instances,universal bases comprise 5-nitroindole or 2′-deoxylnosine. In instances,blockers comprise spacer elements that connect two polynucleotidechains. In some instances, blockers comprise one or more nucleobaseanalogues selected from Table 1. In some instances, such nucleobaseanalogues are added to control the T_(m) of a blocker. Blockers maycomprise any number of nucleobase analogues (such as LNAs or BNAs),depending on the desired hybridization T_(m). For example, a blockercomprises 20 to 40 nucleobase analogues. In some instances, a blockercomprises 8 to 16 nucleobase analogues. In some instances, a blockercomprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, or at least 12nucleobase analogues. In some instances, a blocker comprises about 1, 2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or about 16 nucleobaseanalogues. In some instances, the number of nucleobase analogous isexpressed as a percent of the total bases in the blocker. For example, ablocker comprises at least 1%, 2%, 5%, 10%, 12%, 18%, 24%, 30%, or morethan 30% nucleobase analogues. In some instances, the blocker comprisinga nucleobase analogue raises the T_(m) in a range of about 2° C. toabout 8° C. for each nucleobase analogue. In some instances, the T_(m)is raised by at least or about 1° C., 2° C., 3° C., 4° C., 5° C., 6° C.,7° C., 8° C., 9° C., 10° C., 12° C., 14° C., or 16° C. for eachnucleobase analogue. Such blockers in some instances are configured tobind to the top or “sense” strand of an adapter. Blockers in someinstances are configured to bind to the bottom or “anti-sense” strand ofan adapter. In some instances a set of blockers includes sequences whichare configured to bind to both top and bottom strands of an adapter.Additional blockers in some instances are configured to the complement,reverse, forward, or reverse complement of an adapter sequence. In someinstances, a set of blockers targeting a top (binding to the top) orbottom strand (or both) is designed and tested, followed byoptimization, such as replacing a top blocker with a bottom blocker, ora bottom blocker with a top blocker. In some instances, a blocker isconfigured to overlap fully or partially with bases of an index orbarcode on an adapter. A set of blockers in some instances comprise atleast one blocker overlapping with an adapter index sequence. A set ofblockers in some instances comprise at least one blocker overlappingwith an adapter index sequence, and at least one blocker which does notoverlap with an adapter sequence. A set of blockers in some instancescomprise at least one blocker which does not overlap with a yoke regionsequence. A set of blockers in some instances comprise at least oneblocker which does not overlap with a yoke region sequence and at leastone blocker which overlaps with a yoke region sequence. A sets ofblockers in some instances comprises 2, 3, 4, 5, 6, 7, 8, 9, 10, or morethan 10 blockers.

Blockers may be any length, depending on the size of the adapter orhybridization T_(m). For example, blockers are 20 to 50 bases in length.In some instances, blockers are 25 to 45 bases, 30 to 40 bases, 20 to 40bases, or 30 to 50 bases in length. In some instances, blockers are 25to 35 bases in length. In some instances blockers are at least 25, 26,27, 28, 29, 30, 31, 32, 33, 34, or at least 35 bases in length. In someinstances, blockers are no more than 25, 26, 27, 28, 29, 30, 31, 32, 33,34, or no more than 35 bases in length. In some instances, blockers areabout 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, or about 35 bases inlength. In some instances, blockers are about 50 bases in length. A setof blockers targeting an adapter-tagged genomic library fragment in someinstances comprises blockers of more than one length. Two blockers arein some instances tethered together with a linker. Various linkers arewell known in the art, and in some instances comprise alkyl groups,polyether groups, amine groups, amide groups, or other chemical group.In some instances, linkers comprise individual linker units, which areconnected together (or attached to blocker polynucleotides) through abackbone such as phosphate, thiophosphate, amide, or other backbone. Inan exemplary arrangement, a linker spans the index region between afirst blocker that each targets the 5′ end of the adapter sequence and asecond blocker that targets the 3′ end of the adapter sequence. In someinstances, capping groups are added to the 5′ or 3′ end of the blockerto prevent downstream amplification. Capping groups variously comprisepolyethers, polyalcohols, alkanes, or other non-hybridizable group thatprevents amplification. Such groups are in some instances connectedthrough phosphate, thiophosphate, amide, or other backbone. In someinstances, one or more blockers are used. In some instances, at least 4non-identical blockers are used. In some instances, a first blockerspans a first 3′ end of an adaptor sequence, a second blocker spans afirst 5′ end of an adaptor sequence, a third blocker spans a second 3′end of an adaptor sequence, and a fourth blockers spans a second 5′ endof an adaptor sequence. In some instances a first blocker is at least20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, or at least35 bases in length. In some instances a second blocker is at least 20,21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, or at least 35bases in length. In some instances a third blocker is at least 20, 21,22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, or at least 35 basesin length. In some instances a fourth blocker is at least 20, 21, 22,23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, or at least 35 bases inlength. In some instances, a first blocker, second blocker, thirdblocker, or fourth blocker comprises a nucleobase analogue. In someinstances, the nucleobase analogue is LNA.

The design of blockers may be influenced by the desired hybridizationT_(m) to the adapter sequence. In some instances, non-canonical nucleicacids (for example locked nucleic acids, bridged nucleic acids, or othernon-canonical nucleic acid or analog) are inserted into blockers toincrease or decrease the blocker's T_(m). In some instances, the T_(m)of a blocker is calculated using a tool specific to calculating T_(m)for polynucleotides comprising a non-canonical amino acid. In someinstances, a T_(m) is calculated using the Exiqon online predictiontool. In some instances, blocker T_(m) described herein are calculatedin-silico. In some instances, the blocker T_(m) is calculated in-silico,and is correlated to experimental in-vitro conditions. Without beingbound by theory, an experimentally determined T_(m) may be furtherinfluenced by experimental parameters such as salt concentration,temperature, presence of additives, or other factor. In some instances,T_(m) described herein are in-silico determined T_(m) that are used todesign or optimize blocker performance. In some instances, T_(m) valuesare predicted, estimated, or determined from melting curve analysisexperiments. In some instances, blockers have a T_(m) of 70 degrees C.to 99 degrees C. In some instances, blockers have a T_(m) of 75 degreesC. to 90 degrees C. In some instances, blockers have a T_(m) of at least85 degrees C. In some instances, blockers have a T_(m) of at least 70,72, 75, 77, 80, 82, 85, 88, 90, or at least 92 degrees C. In someinstances, blockers have a T_(m) of about 70, 72, 75, 77, 80, 82, 85,88, 90, 92, or about 95 degrees C. In some instances, blockers have aT_(m) of 78 degrees C. to 90 degrees C. In some instances, blockers havea T_(m) of 79 degrees C. to 90 degrees C. In some instances, blockershave a T_(m) of 80 degrees C. to 90 degrees C. In some instances,blockers have a T_(m) of 81 degrees C. to 90 degrees C. In someinstances, blockers have a T_(m) of 82 degrees C. to 90 degrees C. Insome instances, blockers have a T_(m) of 83 degrees C. to 90 degrees C.In some instances, blockers have a T_(m) of 84 degrees C. to 90 degreesC. In some instances, a set of blockers have an average T_(m) of 78degrees C. to 90 degrees C. In some instances, a set of blockers have anaverage T_(m) of 80 degrees C. to 90 degrees C. In some instances, a setof blockers have an average T_(m) of at least 80 degrees C. In someinstances, a set of blockers have an average T_(m) of at least 81degrees C. In some instances, a set of blockers have an average T_(m) ofat least 82 degrees C. In some instances, a set of blockers have anaverage T_(m) of at least 83 degrees C. In some instances, a set ofblockers have an average T_(m) of at least 84 degrees C. In someinstances, a set of blockers have an average T_(m) of at least 86degrees C. Blocker T_(m) are in some instances modified as a result ofother components described herein, such as use of a fast hybridizationbuffer and/or hybridization enhancer.

The molar ratio of blockers to adapter targets may influence theoff-bait (and subsequently off-target) rates during hybridization. Themore efficient a blocker is at binding to the target adapter, the lessblocker is required. Blockers described herein in some instances achievesequencing outcomes of no more than 20% off-target reads with a molarratio of less than 20:1 (blocker:target). In some instances, no morethan 20% off-target reads are achieved with a molar ratio of less than10:1 (blocker:target). In some instances, no more than 20% off-targetreads are achieved with a molar ratio of less than 5:1 (blocker:target).In some instances, no more than 20% off-target reads are achieved with amolar ratio of less than 2:1 (blocker:target). In some instances, nomore than 20% off-target reads are achieved with a molar ratio of lessthan 1.5:1 (blocker:target). In some instances, no more than 20%off-target reads are achieved with a molar ratio of less than 1.2:1(blocker:target). In some instances, no more than 20% off-target readsare achieved with a molar ratio of less than 1.05:1 (blocker:target).

The universal blockers may be used with panel libraries of varying size.In some embodiments, the panel libraries comprises at least or about0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 1.0, 2.0, 4.0,8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0, 22.0, 24.0, 26.0, 28.0, 30.0,40.0, 50.0, 60.0, or more than 60.0 megabases (Mb).

Blockers as described herein may improve on-target performance. In someembodiments, on-target performance is improved by at least or about 5%,10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%,80%, 85%, 90%, 95%, or more than 95%. In some embodiments, the on-targetperformance is improved by at least or about 5%, 10%, 15%, 20%, 25%,30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, ormore than 95% for various index designs. In some embodiments, theon-target performance is improved by at least or about 5%, 10%, 15%,20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%,90%, 95%, or more than 95% is improved for various panel sizes.

Hybridization Buffers

Any number of buffers may be used with the hybridization methodsdescribed herein. For example, a buffer comprises numerous chemicalcomponents, such as polymers, solvents, salts, surfactants, or othercomponent. In some instances, hybridization buffers decrease thehybridization times (e.g., “fast” hybridization buffers) required toachieve a given sequencing result or level of quality. Such componentsin some instances lead to improved hybridization outcomes, such asincreased on-target rate, improved sequencing outcomes (e.g., sequencingdepth or other metric), or decreased off-target rates. Such componentsmay be introduced at any concentration to achieve such outcomes. In someinstances, buffer components are added in specific order. For example,water is added first. In some instances, salts are added after water. Insome instances, salts are added after thickening agents and surfactants.In some instances, hybridization buffers such as “fast” hybridizationbuffers described herein are used in conjunction with universal blockersand liquid polymer additives. In some instances, use of fasthybridization buffers reduces hybridization times to no more than 4, 3,2, 1, 0.5, 0.2, or 0.1 hours.

Hybridization buffers described herein may comprise solvents, ormixtures of two or more solvents. In some instances, a hybridizationbuffer comprises a mixture of two solvents, three solvents or more thanthree solvents. In some instances, a hybridization buffer comprises amixture of an alcohol and water. In some instances, a hybridizationbuffer comprises a mixture of a ketone containing solvent and water. Insome instances, a hybridization buffer comprises a mixture of anethereal solvent and water. In some instances, a hybridization buffercomprises a mixture of a sulfoxide-containing solvent and water. In someinstances, a hybridization buffer comprises a mixture of amamide-containing solvent and water. In some instances, a hybridizationbuffer comprises a mixture of an ester-containing solvent and water. Insome instances, hybridization buffers comprise solvents such as water,ethanol, methanol, propanol, butanol, other alcohol solvent, or amixture thereof. In some instances, hybridization buffers comprisesolvents such as acetone, methyl ethyl ketone, 2-butanone, ethylacetate, methyl acetate, tetrahydrofuran, diethyl ether, or a mixturethereof. In some instances, hybridization buffers comprise solvents suchas DMSO, DMF, DMA, HMPA, or a mixture thereof. In some instances,hybridization buffers comprise a mixture of water, HMPA, and an alcohol.In some instances, two solvents are present at a 1:1, 1:2, 1:3, 1:4,1:5, 1:8, 1:9, 1:10, 1:20, 1:50, 1:100, or 1:500 ratio.

Hybridization buffers described herein may comprise polymers. Polymersinclude but are not limited to thickening agents, polymeric solvents,dielectric materials, or other polymer. Polymers are in some instanceshydrophobic or hydrophilic. In some instances, polymers are siliconpolymers. In some instances, polymers comprise repeating polyethylene orpolypropylene units, or a mixture thereof. In some instances, polymerscomprise polyvinylpyrrolidone or polyvinylpyridine. In some instances,polymers comprise amino acids. For example, in some instances polymerscomprise proteins. In some instances, polymers comprise casein, milkproteins, bovine serum albumin, or other protein. In some instances,polymers comprise nucleotides, for example, DNA or RNA. In someinstances, polymers comprise polyA, polyT, Cot-1 DNA, or other nucleicacid. In some instances, polymers comprise sugars. For example, in someinstances a polymer comprises glucose, arabinose, galactose, mannose, orother sugar. In some instances, a polymer comprises cellulose or starch.In some instances, a polymer comprises agar, carboxyalkyl cellulose,xanthan, guar gum, locust bean gum, gum karaya, gum tragacanth, gumArabic. In some instances, a polymer comprises a derivative of celluloseor starch, or nitrocellulose, dextran, hydroxyethyl starch, ficoll, or acombination thereof. In some instances, mixtures of polymers are used inhybridization buffers described herein. In some instances, hybridizationbuffers comprise Denhardt's solution. Polymers described herein may bepresent at any concentration suitable for reducing off-target binding.Such concentrations are often represented as a percent by weight,percent by volume, or percent weight per volume. For example, a polymeris present at about 0.0001%, 0.0002%, 0.0005%, 0.0008%, 0.001%, 0.002%,0.005%, 0.008%, 0.01%, 0.02%, 0.05%, 0.08%, 0.1%, 0.2%, 0.5%, 0.8%, 1%,1.2%, 1.5%, 1.8%, 2%, 5%, 10%, 20%, or about 30%. In some instances, apolymer is present at no more than 0.0001%, 0.0002%, 0.0005%, 0.0008%,0.001%, 0.002%, 0.005%, 0.008%, 0.01%, 0.02%, 0.05%, 0.08%, 0.1%, 0.2%,0.5%, 0.8%, 1%, 1.2%, 1.5%, 1.8%, 2%, 5%, 10%, 20%, or no more than 30%.In some instances, a polymer is present in at least 0.0001%, 0.0002%,0.0005%, 0.0008%, 0.001%, 0.002%, 0.005%, 0.008%, 0.01%, 0.02%, 0.05%,0.08%, 0.1%, 0.2%, 0.5%, 0.8%, 1%, 1.2%, 1.5%, 1.8%, 2%, 5%, 10%, 20%,or at least 30%. In some instances, a polymer is present at 0.0001%-10%,0.0002%-5%, 0.0005%-1.5%, 0.0008%-1%, 0.001%-0.2%, 0.002%-0.08%,0.005%-0.02%, or 0.008%-0.05%. In some instances, a polymer is presentat 0.005%-0.1%. In some instances, a polymer is present at 0.05%-0.1%.In some instances, a polymer is present at 0.005%-0.6%. In someinstances, a polymer is present at 1%-30%, 5%-25%, 10%-30%, 15%-30%, or1%-15%. Liquid polymers may be present as a percentage of the totalreaction volume. In some instances, a polymer is about 10%, 20%, 30%,40%, 50%, 60%, 75%, or about 90% of the total volume. In some instances,a polymer is at least 10%, 20%, 30%, 40%, 50%, 60%, 75%, or at least 90%of the total volume. In some instances, a polymer is no more than 10%,20%, 30%, 40%, 50%, 60%, 75%, or no more than 90% of the total volume.In some instances, a polymer is 5%-75%, 5%-65%, 5%-55%, 10%-50%,15%-40%, 20%-50%, 20%-30%, 25%-35%, 5%-35%, 10%-35%, or 20%-40% of thetotal volume. In some instances, a polymer is 25%-45% of the totalvolume. In some instances, hybridization buffers described herein areused in conjunction with universal blockers and liquid polymeradditives.

Hybridization buffers described herein may comprise salts such ascations or anions. For example, hybridization buffer comprises amonovalent or divalent cation. In some instances, a hybridization buffercomprises a monovalent or divalent anion. Cations in some instancescomprise sodium, potassium, magnesium, lithium, tris, or other salt.Anions in some instances comprise sulfate, bisulfate, hydrogensulfate,nitrate, chloride, bromide, citrate, ethylenediaminetetraacetate,dihydrogenphosphate, hydrogenphosphate, or phosphate. In some instances,hybridization buffers comprise salts comprising any combination ofanions and cations (e.g. sodium chloride, sodium sulfate, potassiumphosphate, or other salt). In some instance, a hybridization buffercomprises an ionic liquid. Salts described herein may be present at anyconcentration suitable for reducing off-target binding. Suchconcentrations are often represented as a percent by weight, percent byvolume, or percent weight per volume. For example, a salt is present atabout 0.0001%, 0.0002%, 0.0005%, 0.0008%, 0.001%, 0.002%, 0.005%,0.008%, 0.01%, 0.02%, 0.05%, 0.08%, 0.1%, 0.2%, 0.5%, 0.8%, 1%, 1.2%,1.5%, 1.8%, 2%, 5%, 10%, 20%, or about 30%. In some instances, a salt ispresent at no more than 0.0001%, 0.0002%, 0.0005%, 0.0008%, 0.001%,0.002%, 0.005%, 0.008%, 0.01%, 0.02%, 0.05%, 0.08%, 0.1%, 0.2%, 0.5%,0.8%, 1%, 1.2%, 1.5%, 1.8%, 2%, 5%, 10%, 20%, or no more than 30%. Insome instances, a salt is present in at least 0.0001%, 0.0002%, 0.0005%,0.0008%, 0.001%, 0.002%, 0.005%, 0.008%, 0.01%, 0.02%, 0.05%, 0.08%,0.1%, 0.2%, 0.5%, 0.8%, 1%, 1.2%, 1.5%, 1.8%, 2%, 5%, 10%, 20%, or atleast 30%. In some instances, a salt is present at 0.0001%-10%,0.0002%-5%, 0.0005%-1.5%, 0.0008%-1%, 0.001%-0.2%, 0.002%-0.08%,0.005%-0.02%, or 0.008%-0.05%. In some instances, a salt is present at0.005%-0.1%. In some instances, a salt is present at 0.05%-0.1%. In someinstances, a salt is present at 0.005%-0.6%. In some instances, a saltis present at 1%-30%, 5%-25%, 10%-30%, 15%-30%, or 1%-15%. Liquidpolymers may be present as a percentage of the total reaction volume. Insome instances, a salt is about 10%, 20%, 30%, 40%, 50%, 60%, 75%, orabout 90% of the total volume. In some instances, a salt is at least10%, 20%, 30%, 40%, 50%, 60%, 75%, or at least 90% of the total volume.In some instances, a salt is no more than 10%, 20%, 30%, 40%, 50%, 60%,75%, or no more than 90% of the total volume. In some instances, a saltis 5%-75%, 5%-65%, 5%-55%, 10%-50%, 15%-40%, 20%-50%, 20%-30%, 25%-35%,5%-35%, 10%-35%, or 20%-40% of the total volume. In some instances, asalt is 25%-45% of the total volume.

Hybridization buffers described herein may comprise surfactants (oremulsifiers). For example, a hybridization buffer comprises SDS (sodiumdodecyl sulfate), CTAB, cetylpyridinium, benzalkonium tergitol, fattyacid sulfonates (e.g., sodium lauryl sulfate), ethyloxylated propyleneglycol, lignin sulfonates, benzene sulfonate, lecithin, phospholipids,dialkyl sulfosuccinates (e.g., dioctyl sodium sulfosuccinate), glyceroldiester, polyethoxylated octyl phenol, abietic acid, sorbitan monoester,perfluoro alkanols, sulfonated polystyrene, betaines, dimethylpolysiloxanes, or other surfactant. In some instances, a hybridizationbuffer comprises a sulfate, phosphate, or tetralkyl ammonium group.Surfactants described herein may be present at any concentrationsuitable for reducing off-target binding. Such concentrations are oftenrepresented as a percent by weight, percent by volume, or percent weightper volume. For example, a surfactant is present at about 0.0001%,0.0002%, 0.0005%, 0.0008%, 0.001%, 0.002%, 0.005%, 0.008%, 0.01%, 0.02%,0.05%, 0.08%, 0.1%, 0.2%, 0.5%, 0.8%, 1%, 1.2%, 1.5%, 1.8%, 2%, 5%, 10%,20%, or about 30%. In some instances, a surfactant is present at no morethan 0.0001%, 0.0002%, 0.0005%, 0.0008%, 0.001%, 0.002%, 0.005%, 0.008%,0.01%, 0.02%, 0.05%, 0.08%, 0.1%, 0.2%, 0.5%, 0.8%, 1%, 1.2%, 1.5%,1.8%, 2%, 5%, 10%, 20%, or no more than 30%. In some instances, asurfactant is present in at least 0.0001%, 0.0002%, 0.0005%, 0.0008%,0.001%, 0.002%, 0.005%, 0.008%, 0.01%, 0.02%, 0.05%, 0.08%, 0.1%, 0.2%,0.5%, 0.8%, 1%, 1.2%, 1.5%, 1.8%, 2%, 5%, 10%, 20%, or at least 30%. Insome instances, a surfactant is present at 0.0001%-10%, 0.0002%-5%,0.0005%-1.5%, 0.0008%-1%, 0.001%-0.2%, 0.002%-0.08%, 0.005%-0.02%, or0.008%-0.05%. In some instances, a surfactant is present at 0.005%-0.1%.In some instances, a surfactant is present at 0.05%-0.1%. In someinstances, a surfactant is present at 0.005%-0.6%. In some instances, asurfactant is present at 1%-30%, 5%-25%, 10%-30%, 15%-30%, or 1%-15%.Liquid polymers may be present as a percentage of the total reactionvolume. In some instances, a surfactant is about 10%, 20%, 30%, 40%,50%, 60%, 75%, or about 90% of the total volume. In some instances, asurfactant is at least 10%, 20%, 30%, 40%, 50%, 60%, 75%, or at least90% of the total volume. In some instances, a surfactant is no more than10%, 20%, 30%, 40%, 50%, 60%, 75%, or no more than 90% of the totalvolume. In some instances, a surfactant is 5%-75%, 5%-65%, 5%-55%,10%-50%, 15%-40%, 20%-50%, 20%-30%, 25%-35%, 5%-35%, 10%-35%, or 20%-40%of the total volume. In some instances, a surfactant is 25%-45% of thetotal volume.

Buffers used in the methods described herein may comprise anycombination of components. In some instances, a buffer described hereinis a hybridization buffer. In some instances, a hybridization bufferdescribed herein is a fast hybridization buffer. Such fast hybridizationbuffers allow for lower hybridization times such as less than 8 hours, 6hours, 4 hours, 2 hours, 1 hour, 45 minutes, 30 minutes, or less than 15minutes. Hybridization buffers described herein in some instancescomprise a buffer described in Tables 2A-2G. In some instances, thebuffers described in Tables 1A-1I may be used as fast hybridizationbuffers. In some instances, the buffers described in Tables 1B, 1C, and1D may be used as fast hybridization buffers. In some instances, a fasthybridization buffer as described herein is described in Table 1B. Insome instances, a fast hybridization buffer as described herein isdescribed in Table 1C. In some instances, a fast hybridization buffer asdescribed herein is described in Table 1D.

TABLE 2A Buffers A Volume Volume Buffer Component (mL) Buffer Component(mL) Water  5-300 Water 100-300 DMF 0-3 DMSO 0-3 NaCl (5M) 0.01-0.5 NaCl (5M) 0.01-0.5  20% SDS 0.05-0.5  20% SDS 0.05-0.5  Tergitol (1% byweight) 0.2-3   EDTA (1M) 0-2 Denhardt’s Solution  1-10 Denhardt’sSolution  1-10 (50X) (50X) NaH₂PO₄ (5M) 0.01-1.5  NaH₂PO₄ (5M) 0.01-1.5 

TABLE 2B Buffers B Volume Volume Buffer Component (mL) Buffer Component(mL) Water  5-30 Water  5-30 DMSO 0.5-3   DMSO 0.5-3   NaCl (5M)0.01-0.5  NaCl (5M) 0.01-0.5  20% SDS 0.05-0.5  20% CTAB 0.05-0.5  EDTA(1M) 0.05-2    EDTA (1M) 0.05-2    Denhardt’s Solution  1-10 Denhardt’sSolution  1-10 (50X) (50X) NaH₂PO₄ (5M) 0.01-1.5   NaH₂PO₄ (5M)0.01-1.5 

TABLE 2C Buffers C Volume Volume Buffer Component (mL) Buffer Component(mL) Water  5-30 Water  5-30 DMSO 0.5-3   DMSO 0.5-3   NaCl (1M)0.01-0.5  NaCl (5M) 0.01-0.5  20% SDS 0.05-0.5  20% SDS 0.05-0.5 TrisHCl (1M) 0.01-2.5  Dextran Sulfate (50%) 0.05-2    Denhardt’sSolution  1-10 Denhardt’s Solution  1-10 (50X) (50X) NaH₂PO₄ (5M)0.01-1.5  NaH₂PO₄ (5M) 0.01-1.5  EDTA (0.5 M) 0.05-1.5  EDTA (0.5 M)0.05-1.5 

TABLE 2D Buffers D Volume Volume Buffer Component (mL) Buffer Component(mL) Water  5-30 Water  5-30 Methanol 0.1-3   DMSO 0.5-3   NaCl (1M)0.01-0.5  NaCl (5M) 0.01-0.5  20% Dextran Sulfate 0.05-0.5  20% SDS0.05-0.5  TrisHCl (1M) 0.01-2.5  hydroxyethyl starch 0.05-2    (20%)Denhardt’s Solution  1-10 Denhardt’s Solution  1-10 (50X) (50X) NaH₂PO₄(1M) 0.01-1.5  NaH₂PO₄ (5M) 0.01-1.5  EDTA (0.5 M) 0.05-1.5  EDTA (0.5M) 0.05-1.5 

TABLE 2E Buffers E Volume Volume Buffer Component (mL) Buffer Component(mL) Water  5-300 Water  5-300 DMF 0.1-30  DMSO 0.5-30  NaCl (1M)0.01-0.5  NaCl (5M) 0.01-1.0  hydroxyethyl starch 0.01-2.5  hydroxyethylstarch 0.01-2.5  (20%) (20%) Denhardt’s Solution  1-10 Denhardt’sSolution 0.05-2    (50X) (50X) NaH₂PO₄ (1M) 0.01-1.5  NaH₂PO₄ (5M)  1-10

TABLE 2F Buffers F Volume Volume Buffer Component (mL) Buffer Component(mL) Water  50-300 Water  50-300 DMF  15-300 DMSO  15-300 NaCl (5M)  2-100 NaCl (5M)   2-100 Denhardt’s Solution  1-10 saline-sodium  1-50(50X) citrate 20X Tergitol (1% by weight) 0.2-2.0 20% SDS 0-2

TABLE 2G Buffers G Volume Volume Buffer Component (mL) Buffer Component(mL) Water  5-30 Water  5-30 Ethanol 0-3 Methanol 0-3 NaCl (1M)0.01-0.5  NaCl (5M) 0.01-0.5  NaH₂PO₄ (5M) 0.01-1.5  NaH₂PO₄ (5M) 0-2EDTA (0.5 M)   0-1.5 EDTA (0.5 M)  1-10

TABLE 2H Buffers H Volume Volume Buffer Component (mL) Buffer Component(mL) Water  50-300 Water  10-300 EDTA (0.5 M)   0-1.5 NaCl (5M)0.01-0.5  NaCl (5M)  5-70 10% Triton X-100 0.05-0.5  Tergitol (1% byweight) 0.2-2.0 EDTA (1M) 0-2 TrisHCl (1M) 0.01-2.5  TrisHCl (1M)0.1-5  

TABLE 2I Buffers I Volume Volume Buffer Component (mL) Buffer Component(mL) Water  5-200 Water  10-200 EDTA (0.5 M)   0-1.5 NaCl (5M) 0.01-0.5 NaCl (5M)   5-100 Sodium Lauryl 0.05-0.5  sulfate (10%) CTAB (0.2M)0.05-0.5  EDTA (1M) 0-2

Buffers such as binding buffers and wash buffers are described herein.Binding buffers in some instances are used to prepare mixtures of samplepolynucleotides and probes after hybridization. In some instances,binding buffers facilitate capture of sample polynucleotides on a columnor other solid support. In some instances, the buffers described inTables 2A-2I may be used as binding buffers. Binding buffers in someinstances comprise a buffer described in Tables 2A, 2H, and 2I. In someinstances, a binding buffer as described herein is described in Table2A. In some instances, a binding buffer as described herein is describedin Table 2H. In some instances, a binding buffer as described herein isdescribed in Table 2I. In some instances, the buffers described hereinmay be used as wash buffers. Wash buffers in some instances are used toremove non-binding polynucleotides from a column or solid support. Insome instances, the buffers described in Tables 2A-2I may be used aswash buffers. In some instances, a wash buffer comprises a buffer asdescribed in Tables 2E, 2F, and 2G. In some instances, a wash buffer asdescribed herein is described in Table 2E. In some instances, a washbuffer as described herein is described in Table 2F. In some instances,a wash buffer as described herein is described in Table 2G. Wash buffersused with the compositions and methods described herein are in someinstances described as a first wash buffer (wash buffer 1), second washbuffer (wash buffer 2), etc.

Methods for Sequencing

Described herein are methods to improve the efficiency and accuracy ofsequencing. Such methods comprise use of universal adapters comprisingnucleobase analogues, and generation of barcoded adapters after ligationto sample nucleic acids. In some instances, a sample is fragmented,fragment ends are repaired, one or more adenines is added to one strandof a fragment duplex, universal adapters are ligated, and a library offragments is amplified with barcoded primers to generate a barcodednucleic acid library. Additional steps in some instances includeenrichment/capture, additional PCR amplification, and/or sequencing ofthe nucleic acid library.

In a first step of an exemplary sequencing workflow (FIG. 9), a sample208 comprising sample nucleic acids is fragmented by mechanical orenzymatic shearing to form a library of fragments 209. Universaladapters 220 are ligated to fragmented sample nucleic acids to form anadapter-ligated sample nucleic acid library 221. This library is thenamplified with a barcoded primer library 222 (only one primer shown forsimplicity) to generate a barcoded adapter-sample polynucleotide library223. The library 223 is then optionally hybridized with target bindingpolynucleotides 217, which hybridize to sample nucleic acids, along withblocking polynucleotides 216 that prevent hybridization between probepolynucleotides 217 and adapters 220. Capture of samplepolynucleotide-target binding polynucleotide hybridization pairs212/218, and removal of target binding polynucleotides 217 allowsisolation/enrichment of sample nucleic acids 213, which are thenoptionally amplified and sequenced 214. Various combinations ofuniversal adapters and barcoded primers may be used. In some instances,barcoded primers comprise at least one barcode. In some instances,different types of barcodes are added to the sample nucleic acid usingadapters or barcodes, or both. For example, a universal adaptercomprises an index barcode, and after ligation is amplified with abarcoded primer comprising an additional index barcode. In someinstances, a universal adapter comprises a unique molecular identifierbarcode, and after ligation is amplified with a barcoded primercomprising an index barcode.

Barcoded primers may be used to amplify universal adapter-ligated samplepolynucleotides using PCR, to generate a polynucleic acid library forsequencing. Such a library comprises barcodes after amplification insome instances. In some instances, amplification with barcoded primersresults in higher amplification yields relative to amplification of astandard Y adapter-ligated sample polynucleotide library. In someinstances, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 PCR cycles are used toamplify a universal adapter-ligated sample polynucleotide library. Insome instances, no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or no morethan 12 PCR cycles are used to amplify a universal adapter-ligatedsample polynucleotide library. In some instances, 2-12, 3-10, 4-9, 5-8,6-10, or 8-12 PCR cycles are used to amplify a universal adapter-ligatedsample polynucleotide library, thus generating amplicon products. Suchlibraries in some instances comprise fewer PCR-based errors. Withoutbeing bound by theory, reduced PCR cycles during amplification leads tofewer errors in resulting amplicon products. After amplification, suchbarcoded amplicon libraries are in some instances enriched or subjectedto capture, additional amplification reactions, and/or sequencing. Insome instances, amplicon products generated using the universal adaptersdescribed herein comprise about 30%, 15%, 10%, 7%, 5%, 3%, 2%, 1.5%, 1%,0.5%, 0.1%, or 0.05% fewer errors than amplicon products generated fromamplification of standard full-length Y adapters.

Described herein are methods wherein universal blockers are used toprevent off-target binding of capture probes to adapters ligated togenomic fragments, or adapter-adapter hybridization. Adapter blockersused for preventing off-target hybridization may target a portion or theentire adapter. In some instances, specific blockers are used that arecomplementary to a portion of the adapter that includes the unique indexsequence. In cases where the adapter-tagged genomic library comprises alarge number of different indices, it can be beneficial to designblockers which either do not target the index sequence, or do nothybridize strongly to it. For example, a “universal” blocker targets aportion of the adapter that does not comprise an index sequence (indexindependent), which allows a minimum number of blockers to be usedregardless of the number of different index sequences employed. In someinstances, no more than 8 universal blockers are used. In someinstances, 4 universal blockers are used. In some instances, 3 universalblockers are used. In some instances, 2 universal blockers are used. Insome instances, 1 universal blocker is used. In an exemplaryarrangement, 4 universal blockers are used with adapters comprising atleast 4, 8, 16, 32, 64, 96, or at least 128 different index sequences.In some instances, the different index sequences comprises at least orabout 4, 6, 8, 10, 12, 14, 16, 18, 20, or more than 20 base pairs (bp).In some instances, a universal blocker is not configured to bind to abarcode sequence. In some instances, a universal blocker partially bindsto a barcode sequence. In some instances, a universal blocker whichpartially binds to a barcode sequence further comprises nucleotideanalogs, such as those that increase the T_(m) of binding to the adapter(e.g., LNAs or BNAs).

Provided herein are methods of sequencing nucleic acids with UMIs. Insome instances, a method comprises one or more of ligating one or morepolynucleotide adapters to a plurality of sample nucleic acids togenerate a library of adapter-ligated sample polynucleotides, wherein atleast some of the polynucleotide adapters comprise a unique molecularidentifier; amplifying the library; sequencing the enriched library togenerate a plurality of reads; organizing the reads based on the uniquemolecular identifier to distinguish between amplification errors andsingle nucleotide polymorphisms present in the sample nucleic acids. Insome instances, the polynucleotide adapters comprise a first uniquemolecular identifier and a second unique molecular identifier. Adapterscomprising UMIs described herein and methods of use in sequencingthereof may increase various metrics of sequencing efficiency. In someinstances, UMIs increase accuracy of recall of base calling. In someinstances, UMIs described herein comprise increased duplex efficiency.In some instances, duplex efficiency comprises the number of duplexreads after sequencing after the duplex collapses divided by the totalnumber of input reads. In some instances, adapters comprising UMIscomprise a duplex efficiency of at least 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%,or more. In some instances, adapters comprising UMIs comprise a duplexefficiency of 1%-5%, 1-10%, 1-8%, 2-6%, 2-10%, 3-5%, 3-8%, or 4-10%. Insome instances, a method of using UMIs comprising adapters results in arecall of at least 20% for single nucleotide polymorphisms present atleast at 0.2% abundance in the sample nucleic acids. In some instances,a method of using UMIs comprising adapters results in a recall of atleast 20% for single nucleotide polymorphisms present at least at 0.5%abundance in the sample nucleic acids. In some instances, a method ofusing UMIs comprising adapters results in a recall of at least 20% forsingle nucleotide polymorphisms present at least at 1% abundance in thesample nucleic acids.

Adapters comprising UMIs described herein and methods of use thereof mayincrease efficiency of SNV calling. In some instances at least 80% ofSNV variants are called at a level of at least 0.5%. In some instancesat least 80% of SNV variants are called at a level of at least 1%. Insome instances at least 80% of SNV variants are called at a level of atleast 2%. In some instances at least 80% of SNV variants are called at alevel of at least 1.5%. In some instances at least 90% of SNV variantsare called at a level of at least 1%. In some instances at least 95% ofSNV variants are called at a level of at least 1%. In some instances atleast 95% of SNV variants are called at a level of at least 0.5%. Insome instances at least 80% of SNV variants are called at a level of atleast 0.5% with a minimum sequencing depth of 10,000×. In some instancesat least 80% of SNV variants are called at a level of at least 1% with aminimum sequencing depth of 10,000×. In some instances at least 80% ofSNV variants are called at a level of at least 2% with a minimumsequencing depth of 10,000×. In some instances at least 80% of SNVvariants are called at a level of at least 1.5% with a minimumsequencing depth of 10,000×. In some instances at least 90% of SNVvariants are called at a level of at least 1% with a minimum sequencingdepth of 10,000×. In some instances at least 95% of SNV variants arecalled at a level of at least 1% with a minimum sequencing depth of10,000×. In some instances at least 95% of SNV variants are called at alevel of at least 0.5% with a minimum sequencing depth of 10,000×. Insome instances at least 80% of SNV variants are called at a level of atleast 0.5% with a minimum sequencing depth of 20,000×. In some instancesat least 80% of SNV variants are called at a level of at least 1% with aminimum sequencing depth of 20,000×. In some instances at least 80% ofSNV variants are called at a level of at least 2% with a minimumsequencing depth of 20,000×. In some instances at least 80% of SNVvariants are called at a level of at least 1.5% with a minimumsequencing depth of 20,000×. In some instances at least 90% of SNVvariants are called at a level of at least 1% with a minimum sequencingdepth of 20,000×. In some instances at least 95% of SNV variants arecalled at a level of at least 1% with a minimum sequencing depth of20,000×. In some instances at least 95% of SNV variants are called at alevel of at least 0.5% with a minimum sequencing depth of 20,000×.

Methylation Sequencing and Capture

Methylation sequencing involves enzymatic or chemical methods leading tothe conversion of unmethylated cytosines to uracil through a series ofevents culminating in deamination, while leaving methylated cytosinesintact. During amplification, uracils are paired with adenines on thecomplementary strand, leading to the inclusion of thymine in theoriginal position of the unmethylated cytosine. There are identicalsequences with each having unmethylated-cytosines in differentpositions. The end product is asymmetric, yielding two different doublestranded DNA molecules after conversion; the same process for methylatedDNA leads to yet additional sets of sequences.

Target enrichment can proceed by pre- or post-capture conversion.Post-capture conversion targets the original sample DNA, whilepre-capture targets the four strands of converted sequences. Whilepost-capture conversion presents fewer challenges for probe design, itoften requires large quantities of starting DNA material as PCRamplification does not preserve methylation patterns and cannot beperformed before capture. Therefore, pre-capture conversion is often themethod of choice for low-input, sensitive applications such as cell freeDNA.

Methods described herein may comprise treatment of a library withenzymes or bisulfite to facilitate conversion of cytosines to uracil. Insome instances, adapters (e.g., universal adapters) described hereincomprise methylated nucleobases, such as methylated cytosine.

De Novo Synthesis of Small Polynucleotide Populations for AmplificationReactions

Described herein are methods of synthesis of polynucleotides from asurface, e.g., a plate (FIG. 10). In some instances, the polynucleotidesare synthesized on a cluster of loci for polynucleotide extension,released and then subsequently subjected to an amplification reaction,e.g., PCR. An exemplary workflow of synthesis of polynucleotides from acluster is depicted in FIG. 10. A silicon plate 1001 includes multipleclusters 1003. Within each cluster are multiple loci 1021.Polynucleotides are synthesized 1007 de novo on a plate 1001 from thecluster 1003. Polynucleotides are cleaved 1011 and removed 1013 from theplate to form a population of released polynucleotides 1015. Thepopulation of released polynucleotides 1015 is then amplified 1017 toform a library of amplified polynucleotides 1019.

Provided herein are methods where amplification of polynucleotidessynthesized on a cluster provide for enhanced control overpolynucleotide representation compared to amplification ofpolynucleotides across an entire surface of a structure without such aclustered arrangement. In some instances, amplification ofpolynucleotides synthesized from a surface having a clusteredarrangement of loci for polynucleotides extension provides forovercoming the negative effects on representation due to repeatedsynthesis of large polynucleotide populations. Exemplary negativeeffects on representation due to repeated synthesis of largepolynucleotide populations include, without limitation, amplificationbias resulting from high/low GC content, repeating sequences, trailingadenines, secondary structure, affinity for target sequence binding, ormodified nucleotides in the polynucleotide sequence.

Cluster amplification as opposed to amplification of polynucleotidesacross an entire plate without a clustered arrangement can result in atighter distribution around the mean. For example, if 100,000 reads arerandomly sampled, an average of 8 reads per sequence would yield alibrary with a distribution of about 1.5× from the mean. In some cases,single cluster amplification results in at most about 1.5×, 1.6×, 1.7×,1.8×, 1.9×, or 2.0× from the mean. In some cases, single clusteramplification results in at least about 1.0×, 1.2×, 1.3×, 1.5×1.6×,1.7×, 1.8×, 1.9×, or 2.0× from the mean.

Cluster amplification methods described herein when compared toamplification across a plate can result in a polynucleotide library thatrequires less sequencing for equivalent sequence representation. In someinstances at least 10%, at least 20%, at least 30%, at least 40%, atleast 50%, at least 60%, at least 70%, at least 80%, at least 90%, or atleast 95% less sequencing is required. In some instances up to 10%, upto 20%, up to 30%, up to 40%, up to 50%, up to 60%, up to 70%, up to80%, up to 90%, or up to 95% less sequencing is required. Sometimes 30%less sequencing is required following cluster amplification compared toamplification across a plate. Sequencing of polynucleotides in someinstances is verified by high-throughput sequencing such as by nextgeneration sequencing. Sequencing of the sequencing library can beperformed with any appropriate sequencing technology, including but notlimited to single-molecule real-time (SMRT) sequencing, polonysequencing, sequencing by ligation, reversible terminator sequencing,proton detection sequencing, ion semiconductor sequencing, nanoporesequencing, electronic sequencing, pyrosequencing, Maxam-Gilbertsequencing, chain termination (e.g., Sanger) sequencing, +S sequencing,or sequencing by synthesis. The number of times a single nucleotide orpolynucleotide is identified or “read” is defined as the sequencingdepth or read depth. In some cases, the read depth is referred to as afold coverage, for example, 55 fold (or 55×) coverage, optionallydescribing a percentage of bases.

In some instances, amplification from a clustered arrangement comparedto amplification across a plate results in less dropouts, or sequenceswhich are not detected after sequencing of amplification product.Dropouts can be of AT and/or GC. In some instances, a number of dropoutsare at most about 1%, 2%, 3%, 4%, or 5% of a polynucleotide population.In some cases, the number of dropouts is zero.

A cluster as described herein comprises a collection of discrete,non-overlapping loci for polynucleotide synthesis. A cluster cancomprise about 50-1000, 75-900, 100-800, 125-700, 150-600, 200-500, or300-400 loci. In some instances, each cluster includes 121 loci. In someinstances, each cluster includes about 50-500, 50-200, 100-150 loci. Insome instances, each cluster includes at least about 50, 100, 150, 200,500, 1000 or more loci. In some instances, a single plate includes 100,500, 10000, 20000, 30000, 50000, 100000, 500000, 700000, 1000000 or moreloci. A locus can be a spot, well, microwell, channel, or post. In someinstances, each cluster has at least 1×, 2×, 3×, 4×, 5×, 6×, 7×, 8×, 9×,10×, or more redundancy of separate features supporting extension ofpolynucleotides having identical sequence.

Generation of Polynucleotide Libraries with Controlled Stoichiometry ofSequence Content

In some instances, the polynucleotide library is synthesized with aspecified distribution of desired polynucleotide sequences. In someinstances, adjusting polynucleotide libraries for enrichment of specificdesired sequences results in improved downstream application outcomes.

One or more specific sequences can be selected based on their evaluationin a downstream application. In some instances, the evaluation isbinding affinity to target sequences for amplification, enrichment, ordetection, stability, melting temperature, biological activity, abilityto assemble into larger fragments, or other property of polynucleotides.In some instances, the evaluation is empirical or predicted from priorexperiments and/or computer algorithms. An exemplary applicationincludes increasing sequences in a probe library which correspond toareas of a genomic target having less than average read depth.

Selected sequences in a polynucleotide library can be at least 10%, 20%,30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or more than 95% of thesequences. In some instances, selected sequences in a polynucleotidelibrary are at most 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, orat most 100% of the sequences. In some cases, selected sequences are ina range of about 5-95%, 10-90%, 30-80%, 40-75%, or 50-70% of thesequences.

Polynucleotide libraries can be adjusted for the frequency of eachselected sequence. In some instances, polynucleotide libraries favor ahigher number of selected sequences. For example, a library is designedwhere increased polynucleotide frequency of selected sequences is in arange of about 40% to about 90%. In some instances, polynucleotidelibraries contain a low number of selected sequences. For example, alibrary is designed where increased polynucleotide frequency of theselected sequences is in a range of about 10% to about 60%. A librarycan be designed to favor a higher and lower frequency of selectedsequences. In some instances, a library favors uniform sequencerepresentation. For example, polynucleotide frequency is uniform withregard to selected sequence frequency, in a range of about 10% to about90%. In some instances, a library comprises polynucleotides with aselected sequence frequency of about 10% to about 95% of the sequences.

Generation of polynucleotide libraries with a specified selectedsequence frequency in some cases occurs by combining at least 2polynucleotide libraries with different selected sequence frequencycontent. In some instances, at least 2, 3, 4, 5, 6, 7, 10, or more than10 polynucleotide libraries are combined to generate a population ofpolynucleotides with a specified selected sequence frequency. In somecases, no more than 2, 3, 4, 5, 6, 7, or 10 polynucleotide libraries arecombined to generate a population of non-identical polynucleotides witha specified selected sequence frequency.

In some instances, selected sequence frequency is adjusted bysynthesizing fewer or more polynucleotides per cluster. For example, atleast 25, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, or morethan 1000 non-identical polynucleotides are synthesized on a singlecluster. In some cases, no more than about 50, 100, 200, 300, 400, 500,600, 700, 800, 900, 1000 non-identical polynucleotides are synthesizedon a single cluster. In some instances, 50 to 500 non-identicalpolynucleotides are synthesized on a single cluster. In some instances,100 to 200 non-identical polynucleotides are synthesized on a singlecluster. In some instances, about 100, about 120, about 125, about 130,about 150, about 175, or about 200 non-identical polynucleotides aresynthesized on a single cluster.

In some cases, selected sequence frequency is adjusted by synthesizingnon-identical polynucleotides of varying length. For example, the lengthof each of the non-identical polynucleotides synthesized may be at leastor about at least 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200,300, 400, 500, 2000 nucleotides, or more. The length of thenon-identical polynucleotides synthesized may be at most or about atmost 2000, 500, 400, 300, 200, 150, 100, 50, 45, 35, 30, 25, 20, 19, 18,17, 16, 15, 14, 13, 12, 11, 10 nucleotides, or less. The length of eachof the non-identical polynucleotides synthesized may fall from 10-2000,10-500, 9-400, 11-300, 12-200, 13-150, 14-100, 15-50, 16-45, 17-40,18-35, and 19-25.

Polynucleotide Probe Structures

Libraries of polynucleotide probes can be used to enrich particulartarget sequences in a larger population of sample polynucleotides. Insome instances, polynucleotide probes each comprise a target bindingsequence complementary to one or more target sequences, one or morenon-target binding sequences, and one or more primer binding sites, suchas universal primer binding sites. Target binding sequences that arecomplementary or at least partially complementary in some instances bind(hybridize) to target sequences. Primer binding sites, such as universalprimer binding sites facilitate simultaneous amplification of allmembers of the probe library, or a subpopulation of members. In someinstances, the probes or adapters further comprise a barcode or indexsequence. Barcodes are nucleic acid sequences that allow some feature ofa polynucleotide with which the barcode is associated to be identified.After sequencing, the barcode region provides an indicator foridentifying a characteristic associated with the coding region or samplesource. Barcodes can be designed at suitable lengths to allow sufficientdegree of identification, e.g., at least about 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,47, 48, 49, 50, 51, 52, 53, 54, 55, or more bases in length. Multiplebarcodes, such as about 2, 3, 4, 5, 6, 7, 8, 9, 10, or more barcodes,may be used on the same molecule, optionally separated by non-barcodesequences. In some instances, each barcode in a plurality of barcodesdiffer from every other barcode in the plurality at least three basepositions, such as at least about 3, 4, 5, 6, 7, 8, 9, 10, or morepositions. Use of barcodes allows for the pooling and simultaneousprocessing of multiple libraries for downstream applications, such assequencing (multiplex). In some instances, at least 4, 8, 16, 32, 48,64, 128, 512, 1024, 2000, 5000, or more than 5000 barcoded libraries areused. In some instances, the polynucleotides are ligated to one or moremolecular (or affinity) tags such as a small molecule, peptide, antigen,metal, or protein to form a probe for subsequent capture of the targetsequences of interest. In some instances, only a portion of thepolynucleotides are ligated to a molecular tag. In some instances, twoprobes that possess complementary target binding sequences which arecapable of hybridization form a double stranded probe pair.Polynucleotide probes or adapters may comprise unique molecularidentifiers (UMI). UMIs allow for internal measurement of initial sampleconcentrations or stoichiometry prior to downstream sample processing(e.g., PCR or enrichment steps) which can introduce bias. In someinstances, UMIs comprise one or more barcode sequences.

Probes described here may be complementary to target sequences which aresequences in a genome. Probes described here may be complementary totarget sequences which are exome sequences in a genome. Probes describedhere may be complementary to target sequences which are intron sequencesin a genome. In some instances, probes comprise a target bindingsequence complementary to a target sequence (of the sample nucleicacid), and at least one non-target binding sequence that is notcomplementary to the target. In some instances, the target bindingsequence of the probe is about 120 nucleotides in length, or at least10, 15, 20, 25, 50, 75, 100, 110, 120, 125, 140, 150, 160, 175, 200,300, 400, 500, or more than 500 nucleotides in length. The targetbinding sequence is in some instances no more than 10, 15, 20, 25, 50,75, 100, 125, 150, 175, 200, or no more than 500 nucleotides in length.The target binding sequence of the probe is in some instances about 120nucleotides in length, or about 10, 15, 20, 25, 40, 50, 60, 70, 80, 85,87, 90, 95, 97, 100, 105, 110, 115, 117, 118, 119, 120, 121, 122, 123,124, 125, 126, 127, 128, 129, 130, 135, 140, 145, 150, 155, 157, 158,159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 175, 180,190, 200, 210, 220, 230, 240, 250, 300, 400, or about 500 nucleotides inlength. The target binding sequence is in some instances about 20 toabout 400 nucleotides in length, or about 30 to about 175, about 40 toabout 160, about 50 to about 150, about 75 to about 130, about 90 toabout 120, or about 100 to about 140 nucleotides in length. Thenon-target binding sequence(s) of the probe is in some instances atleast about 20 nucleotides in length, or at least about 1, 5, 10, 15,17, 20, 23, 25, 50, 75, 100, 110, 120, 125, 140, 150, 160, 175, or morethan about 175 nucleotides in length. The non-target binding sequenceoften is no more than about 5, 10, 15, 20, 25, 50, 75, 100, 125, 150,175, or no more than about 200 nucleotides in length. The non-targetbinding sequence of the probe often is about 20 nucleotides in length,or about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,19, 20, 21, 22, 23, 25, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,150, or about 200 nucleotides in length. The non-target binding sequencein some instances is about 1 to about 250 nucleotides in length, orabout 20 to about 200, about 10 to about 100, about 10 to about 50,about 30 to about 100, about 5 to about 40, or about 15 to about 35nucleotides in length. The non-target binding sequence often comprisessequences that are not complementary to the target sequence, and/orcomprise sequences that are not used to bind primers. In some instances,the non-target binding sequence comprises a repeat of a singlenucleotide, for example polyadenine or polythymidine. A probe oftencomprises none or at least one non-target binding sequence. In someinstances, a probe comprises one or two non-target binding sequences.The non-target binding sequence may be adjacent to one or more targetbinding sequences in a probe. For example, a non-target binding sequenceis located on the 5′ or 3′ end of the probe. In some instances, thenon-target binding sequence is attached to a molecular tag or spacer.

In some instances, the non-target binding sequence(s) may be a primerbinding site. The primer binding sites often are each at least about 20nucleotides in length, or at least about 10, 12, 14, 16, 18, 20, 22, 24,26, 28, 30, 32, 34, 36, 38, or at least about 40 nucleotides in length.Each primer binding site in some instances is no more than about 10, 12,14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, or no more thanabout 40 nucleotides in length. Each primer binding site in someinstances is about 10 to about 50 nucleotides in length, or about 15 toabout 40, about 20 to about 30, about 10 to about 40, about 10 to about30, about 30 to about 50, or about 20 to about 60 nucleotides in length.In some instances the polynucleotide probes comprise at least two primerbinding sites. In some instances, primer binding sites may be universalprimer binding sites, wherein all probes comprise identical primerbinding sequences at these sites. In some instances, a pair ofpolynucleotide probes targeting a particular sequence and its reversecomplement (e.g., a region of genomic DNA), comprising a first targetbinding sequence, a second target binding sequence, a first non-targetbinding sequence, and a second non-target binding sequence. For example,a pair of polynucleotide probes complementary to a particular sequence(e.g., a region of genomic DNA).

In some instances, the first target binding sequence is the reversecomplement of the second target binding sequence. In some instances,both target binding sequences are chemically synthesized prior toamplification. In an alternative arrangement, a pair of polynucleotideprobes targeting a particular sequence and its reverse complement (e.g.,a region of genomic DNA) comprise a first target binding sequence, asecond target binding sequence, a first non-target binding sequence, asecond non-target binding sequence, a third non-target binding sequence,and a fourth non-target binding sequence. In some instances, the firsttarget binding sequence is the reverse complement of the second targetbinding sequence. In some instances, one or more non-target bindingsequences comprise polyadenine or polythymidine.

In some instances, both probes in the pair are labeled with at least onemolecular tag. In some instances, PCR is used to introduce moleculartags (via primers comprising the molecular tag) onto the probes duringamplification. In some instances, the molecular tag comprises one ormore biotin, folate, a polyhistidine, a FLAG tag, glutathione, or othermolecular tag consistent with the specification. In some instancesprobes are labeled at the 5′ terminus. In some instances, the probes arelabeled at the 3′ terminus. In some instances, both the 5′ and 3′termini are labeled with a molecular tag. In some instances, the 5′terminus of a first probe in a pair is labeled with at least onemolecular tag, and the 3′ terminus of a second probe in the pair islabeled with at least one molecular tag. In some instances, a spacer ispresent between one or more molecular tags and the nucleic acids of theprobe. In some instances, the spacer may comprise an alkyl, polyol, orpolyamino chain, a peptide, or a polynucleotide. The solid support usedto capture probe-target nucleic acid complexes in some instances, is abead or a surface. The solid support in some instances comprises glass,plastic, or other material capable of comprising a capture moiety thatwill bind the molecular tag. In some instances, a bead is a magneticbead. For example, probes labeled with biotin are captured with amagnetic bead comprising streptavidin. The probes are contacted with alibrary of nucleic acids to allow binding of the probes to targetsequences. In some instances, blocking polynucleic acids are added toprevent binding of the probes to one or more adapter sequences attachedto the target nucleic acids. In some instances, blocking polynucleicacids comprise one or more nucleic acid analogues. In some instances,blocking polynucleic acids have a uracil substituted for thymine at oneor more positions.

Probes described herein may comprise complementary target bindingsequences which bind to one or more target nucleic acid sequences. Insome instances, the target sequences are any DNA or RNA nucleic acidsequence. In some instances, target sequences may be longer than theprobe insert. In some instance, target sequences may be shorter than theprobe insert. In some instance, target sequences may be the same lengthas the probe insert. For example, the length of the target sequence maybe at least or about at least 2, 10, 15, 20, 25, 30, 35, 40, 45, 50,100, 150, 200, 300, 400, 500, 1000, 2000, 5,000, 12,000, 20,000nucleotides, or more. The length of the target sequence may be at mostor about at most 20,000, 12,000, 5,000, 2,000, 1,000, 500, 400, 300,200, 150, 100, 50, 45, 35, 30, 25, 20, 19, 18, 17, 16, 15, 14, 13, 12,11, 10, 2 nucleotides, or less. The length of the target sequence mayfall from 2-20,000, 3-12,000, 5-5, 5000, 10-2,000, 10-1,000, 10-500,9-400, 11-300, 12-200, 13-150, 14-100, 15-50, 16-45, 17-40, 18-35, and19-25. The probe sequences may target sequences associated with specificgenes, diseases, regulatory pathways, or other biological functionsconsistent with the specification.

In some instances, a single probe insert is complementary to one or moretarget sequences in a larger polynucleic acid (e.g., sample nucleicacid). An exemplary target sequence is an exon. In some instances, oneor more probes target a single target sequence. In some instances, asingle probe may target more than one target sequence. In someinstances, the target binding sequence of the probe targets both atarget sequence and an adjacent sequence. In some instances, a firstprobe targets a first region and a second region of a target sequence,and a second probe targets the second region and a third region of thetarget sequence. In some instances, a plurality of probes targets asingle target sequence, wherein the target binding sequences of theplurality of probes contain one or more sequences which overlap withregard to complementarity to a region of the target sequence. In someinstances, probe inserts do not overlap with regard to complementarityto a region of the target sequence. In some instances, at least at least2, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500,1000, 2000, 5,000, 12,000, 20,000, or more than 20,000 probes target asingle target sequence. In some instances no more than 4 probes directedto a single target sequence overlap, or no more than 3, 2, 1, or noprobes targeting a single target sequence overlap. In some instances,one or more probes do not target all bases in a target sequence, leavingone or more gaps. In some instances, the gaps are near the middle of thetarget sequence. In some instances, the gaps are at the 5′ or 3′ ends ofthe target sequence. In some instances, the gaps are 6 nucleotides inlength. In some instances, the gaps are no more than 1, 2, 3, 4, 5, 6,7, 8, 9, 10, 20, 30, 40, or no more than 50 nucleotides in length. Insome instances, the gaps are at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20,30, 40, or at least 50 nucleotides in length. In some instances, the gaplength falls within 1-50, 1-40, 1-30, 1-20, 1-10, 2-30, 2-20, 2-10,3-50, 3-25, 3-10, or 3-8 nucleotides in length. In some instances, a setof probes targeting a sequence do not comprise overlapping regionsamongst probes in the set when hybridized to complementary sequence. Insome instances, a set of probes targeting a sequence do not have anygaps amongst probes in the set when hybridized to complementarysequence. Probes may be designed to maximize uniform binding to targetsequences. In some instances, probes are designed to minimize targetbinding sequences of high or low GC content, secondary structure,repetitive/palindromic sequences, or other sequence feature that mayinterfere with probe binding to a target. In some instances, a singleprobe may target a plurality of target sequences.

A probe library described herein may comprise at least 10, 20, 50, 100,200, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 50,000, 100,000, 200,000,500,000, 1,000,000 or more than 1,000,000 probes. A probe library mayhave no more than 10, 20, 50, 100, 200, 500, 1,000, 2,000, 5,000,10,000, 20,000, 50,000, 100,000, 200,000, 500,000, or no more than1,000,000 probes. A probe library may comprise 10 to 500, 20 to 1000, 50to 2000, 100 to 5000, 500 to 10,000, 1,000 to 5,000, 10,000 to 50,000,100,000 to 500,000, or 50,000 to 1,000,000 probes. A probe library maycomprise about 370,000; 400,000; 500,000 or more different probes.

Next Generation Sequencing Applications

Downstream applications of polynucleotide libraries may include nextgeneration sequencing. For example, enrichment of target sequences witha controlled stoichiometry polynucleotide probe library results in moreefficient sequencing. The performance of a polynucleotide library forcapturing or hybridizing to targets may be defined by a number ofdifferent metrics describing efficiency, accuracy, and precision. Forexample, Picard metrics comprise variables such as HS library size (thenumber of unique molecules in the library that correspond to targetregions, calculated from read pairs), mean target coverage (thepercentage of bases reaching a specific coverage level), depth ofcoverage (number of reads including a given nucleotide) fold enrichment(sequence reads mapping uniquely to the target/reads mapping to thetotal sample, multiplied by the total sample length/target length),percent off-bait bases (percent of bases not corresponding to bases ofthe probes/baits), percent off-target (percent of bases notcorresponding to bases of interest), usable bases on target, AT or GCdropout rate, fold 80 base penalty (fold over-coverage needed to raise80 percent of non-zero targets to the mean coverage level), percent zerocoverage targets, PF reads (the number of reads passing a qualityfilter), percent selected bases (the sum of on-bait bases and near-baitbases divided by the total aligned bases), percent duplication, or othervariable consistent with the specification.

Read depth (sequencing depth, or sampling) represents the total numberof times a sequenced nucleic acid fragment (a “read”) is obtained for asequence. Theoretical read depth is defined as the expected number oftimes the same nucleotide is read, assuming reads are perfectlydistributed throughout an idealized genome. Read depth is expressed asfunction of % coverage (or coverage breadth). For example, 10 millionreads of a 1 million base genome, perfectly distributed, theoreticallyresults in 10× read depth of 100% of the sequences. In practice, agreater number of reads (higher theoretical read depth, or oversampling)may be needed to obtain the desired read depth for a percentage of thetarget sequences. Enrichment of target sequences with a controlledstoichiometry probe library increases the efficiency of downstreamsequencing, as fewer total reads will be required to obtain an outcomewith an acceptable number of reads over a desired % of target sequences.For example, in some instances 55× theoretical read depth of targetsequences results in at least 30× coverage of at least 90% of thesequences. In some instances no more than 55× theoretical read depth oftarget sequences results in at least 30× read depth of at least 80% ofthe sequences. In some instances no more than 55× theoretical read depthof target sequences results in at least 30× read depth of at least 95%of the sequences. In some instances no more than 55× theoretical readdepth of target sequences results in at least 10× read depth of at least98% of the sequences. In some instances, 55× theoretical read depth oftarget sequences results in at least 20× read depth of at least 98% ofthe sequences. In some instances no more than 55× theoretical read depthof target sequences results in at least 5× read depth of at least 98% ofthe sequences. Increasing the concentration of probes duringhybridization with targets can lead to an increase in read depth. Insome instances, the concentration of probes is increased by at least1.5×, 2.0×, 2.5×, 3×, 3.5×, 4×, 5×, or more than 5×. In some instances,increasing the probe concentration results in at least a 1000% increase,or a 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 500%,750%, 1000%, or more than a 1000% increase in read depth. In someinstances, increasing the probe concentration by 3× results in a 1000%increase in read depth. In some instances, sequencing is performed toachieve a theoretical read depth of at least 30×, 50×, 100×, 150×, 200×,250×, 300×, 500×, or at least 1000×. In some instances, sequencing isperformed to achieve a theoretical read depth of about 30×, 50×, 100×,150×, 200×, 250×, 300×, 500×, or about 1000×. In some instances,sequencing is performed to achieve a theoretical read depth of no morethan 30×, 50×, 100×, 150×, 200×, 250×, 300×, 500×, or no more than1000×. In some instances, sequencing is performed to achieve an actualread depth of at least 30×, 50×, 100×, 150×, 200×, 250×, 300×, 500×, orat least 1000×. In some instances, sequencing is performed to achieve anactual read depth of no more than 30×, 50×, 100×, 150×, 200×, 250×,300×, 500×, or no more than 1000×. In some instances, sequencing isperformed to achieve an actual read depth of about 30×, 50×, 100×, 150×,200×, 250×, 300×, 500×, or about 1000×.

On-target rate represents the percentage of sequencing reads thatcorrespond with the desired target sequences. In some instances, acontrolled stoichiometry polynucleotide probe library results in anon-target rate of at least 30%, or at least 35%, 40%, 45%, 50%, 55%,60%, 65%, 70%, 75%, 80%, 85%, or at least 90%. Increasing theconcentration of polynucleotide probes during contact with targetnucleic acids leads to an increase in the on-target rate. In someinstances, the concentration of probes is increased by at least 1.5×,2.0×, 2.5×, 3×, 3.5×, 4×, 5×, or more than 5×. In some instances,increasing the probe concentration results in at least a 20% increase,or a 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, orat least a 500% increase in on-target binding. In some instances,increasing the probe concentration by 3× results in a 20% increase inon-target rate.

Coverage uniformity is in some cases calculated as the read depth as afunction of the target sequence identity. Higher coverage uniformityresults in a lower number of sequencing reads needed to obtain thedesired read depth. For example, a property of the target sequence mayaffect the read depth, for example, high or low GC or AT content,repeating sequences, trailing adenines, secondary structure, affinityfor target sequence binding (for amplification, enrichment, ordetection), stability, melting temperature, biological activity, abilityto assemble into larger fragments, sequences containing modifiednucleotides or nucleotide analogues, or any other property ofpolynucleotides. Enrichment of target sequences with controlledstoichiometry polynucleotide probe libraries results in higher coverageuniformity after sequencing. In some instances, 95% of the sequenceshave a read depth that is within 1× of the mean library read depth, orabout 0.05, 0.1, 0.2, 0.5, 0.7, 1, 1.2, 1.5, 1.7 or about within 2× themean library read depth. In some instances, 80%, 85%, 90%, 95%, 97%, or99% of the sequences have a read depth that is within 1× of the mean.

Enrichment of Target Nucleic Acids with a Polynucleotide Probe Library

A probe library described herein may be used to enrich targetpolynucleotides present in a population of sample polynucleotides, for avariety of downstream applications. In one some instances, a sample isobtained from one or more sources, and the population of samplepolynucleotides is isolated. Samples are obtained (by way ofnon-limiting example) from biological sources such as saliva, blood,tissue, skin, or completely synthetic sources. The plurality ofpolynucleotides obtained from the sample are fragmented, end-repaired,and adenylated to form a double stranded sample nucleic acid fragment.In some instances, end repair is accomplished by treatment with one ormore enzymes, such as T4 DNA polymerase, klenow enzyme, and T4polynucleotide kinase in an appropriate buffer. A nucleotide overhang tofacilitate ligation to adapters is added, in some instances with 3′ to5′ exo minus klenow fragment and dATP.

Adapters (such as universal adapters) may be ligated to both ends of thesample polynucleotide fragments with a ligase, such as T4 ligase, toproduce a library of adapter-tagged polynucleotide strands, and theadapter-tagged polynucleotide library is amplified with primers, such asuniversal primers. In some instances, the adapters are Y-shaped adapterscomprising one or more primer binding sites, one or more graftingregions, and one or more index (or barcode) regions. In some instances,the one or more index region is present on each strand of the adapter.In some instances, grafting regions are complementary to a flowcellsurface, and facilitate next generation sequencing of sample libraries.In some instances, Y-shaped adapters comprise partially complementarysequences. In some instances, Y-shaped adapters comprise a singlethymidine overhang which hybridizes to the overhanging adenine of thedouble stranded adapter-tagged polynucleotide strands. Y-shaped adaptersmay comprise modified nucleic acids, that are resistant to cleavage. Forexample, a phosphorothioate backbone is used to attach an overhangingthymidine to the 3′ end of the adapters. If universal primers are used,amplification of the library is performed to add barcoded primers to theadapters. In some instances, an enrichment workflow is depicted in FIG.5. A library 208 of double stranded adapter-tagged polynucleotidestrands 209 is contacted with polynucleotide probes 217, to form hybridpairs 218. Such pairs are separated 212 from unhybridized fragments, andisolated from probes to produce an enriched library 213. The enrichedlibrary may then be sequenced 214.

The library of double stranded sample nucleic acid fragments is thendenatured in the presence of adapter blockers. Adapter blockers minimizeoff-target hybridization of probes to the adapter sequences (instead oftarget sequences) present on the adapter-tagged polynucleotide strands,and/or prevent intermolecular hybridization of adapters (i.e., “daisychaining”). Denaturation is carried out in some instances at 96° C., orat about 85, 87, 90, 92, 95, 97, 98 or about 99° C. A polynucleotidetargeting library (probe library) is denatured in a hybridizationsolution, in some instances at 96° C., at about 85, 87, 90, 92, 95, 97,98 or 99° C. The denatured adapter-tagged polynucleotide library and thehybridization solution are incubated for a suitable amount of time andat a suitable temperature to allow the probes to hybridize with theircomplementary target sequences. In some instances, a suitablehybridization temperature is about 45 to 80° C., or at least 45, 50, 55,60, 65, 70, 75, 80, 85, or 90° C. In some instances, the hybridizationtemperature is 70° C. In some instances, a suitable hybridization timeis 16 hours, or at least 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, or morethan 22 hours, or about 12 to 20 hours. Binding buffer is then added tothe hybridized adapter-tagged-polynucleotide probes, and a solid supportcomprising a capture moiety is used to selectively bind the hybridizedadapter-tagged polynucleotide-probes. The solid support is washed withbuffer to remove unbound polynucleotides before an elution buffer isadded to release the enriched, tagged polynucleotide fragments from thesolid support. In some instances, the solid support is washed 2 times,or 1, 2, 3, 4, 5, or 6 times. The enriched library of adapter-taggedpolynucleotide fragments is amplified and the enriched library issequenced.

A plurality of nucleic acids (i.e. genomic sequence) may obtained from asample, and fragmented, optionally end-repaired, and adenylated.Adapters are ligated to both ends of the polynucleotide fragments toproduce a library of adapter-tagged polynucleotide strands, and theadapter-tagged polynucleotide library is amplified. The adapter-taggedpolynucleotide library is then denatured at high temperature, preferably96° C., in the presence of adapter blockers. A polynucleotide targetinglibrary (probe library) is denatured in a hybridization solution at hightemperature, preferably about 90 to 99° C., and combined with thedenatured, tagged polynucleotide library in hybridization solution forabout 10 to 24 hours at about 45 to 80° C. Binding buffer is then addedto the hybridized tagged polynucleotide probes, and a solid supportcomprising a capture moiety are used to selectively bind the hybridizedadapter-tagged polynucleotide-probes. The solid support is washed one ormore times with buffer, preferably about 2 and 5 times to remove unboundpolynucleotides before an elution buffer is added to release theenriched, adapter-tagged polynucleotide fragments from the solidsupport. The enriched library of adapter-tagged polynucleotide fragmentsis amplified and then the library is sequenced. Alternative variablessuch as incubation times, temperatures, reaction volumes/concentrations,number of washes, or other variables consistent with the specificationare also employed in the method.

In any of the instances, the detection or quantification analysis of theoligonucleotides can be accomplished by sequencing. The subunits orentire synthesized oligonucleotides can be detected via full sequencingof all oligonucleotides by any suitable methods known in the art, e.g.,Illumina sequencing by synthesis, PacBio nanopore sequencing, or BGI/MGInanoball sequencing, including the sequencing methods described herein.

Sequencing can be accomplished through classic Sanger sequencing methodswhich are well known in the art. Sequencing can also be accomplishedusing high-throughput systems some of which allow detection of asequenced nucleotide immediately after or upon its incorporation into agrowing strand, i.e., detection of sequence in red time or substantiallyreal time. In some cases, high throughput sequencing generates at least1,000, at least 5,000, at least 10,000, at least 20,000, at least30,000, at least 40,000, at least 50,000, at least 100,000 or at least500,000 sequence reads per hour; with each read being at least 50, atleast 60, at least 70, at least 80, at least 90, at least 100, at least120 or at least 150 bases per read.

In some instances, high-throughput sequencing involves the use oftechnology available by Illumina's Genome Analyzer IIX, MiSeq personalsequencer, or HiSeq systems, such as those using HiSeq 2500, HiSeq 1500,HiSeq 2000, HiSeq 1000, iSeq 100, Mini Seq, MiSeq, NextSeq 550, NextSeq2000, NextSeq 550, or NovaSeq 6000. These machines use reversibleterminator-based sequencing by synthesis chemistry. These machines cangenerate 6000 Gb or more reads in 13-44 hours. Smaller systems may beutilized for runs within 3, 2, 1 days or less time. Short synthesiscycles may be used to minimize the time it takes to obtain sequencingresults.

In some instances, high-throughput sequencing involves the use oftechnology available by ABI Solid System. This genetic analysis platformthat enables massively parallel sequencing of clonally-amplified DNAfragments linked to beads. The sequencing methodology is based onsequential ligation with dye-labeled oligonucleotides.

The next generation sequencing can comprise ion semiconductor sequencing(e.g., using technology from Life Technologies (Ion Torrent)). Ionsemiconductor sequencing can take advantage of the fact that when anucleotide is incorporated into a strand of DNA, an ion can be released.To perform ion semiconductor sequencing, a high density array ofmicromachined wells can be formed. Each well can hold a single DNAtemplate. Beneath the well can be an ion sensitive layer, and beneaththe ion sensitive layer can be an ion sensor. When a nucleotide is addedto a DNA, H+ can be released, which can be measured as a change in pH.The H+ ion can be converted to voltage and recorded by the semiconductorsensor. An array chip can be sequentially flooded with one nucleotideafter another. No scanning, light, or cameras can be required. In somecases, an IONPROTON™ Sequencer is used to sequence nucleic acid. In somecases, an IONPGM™ Sequencer is used. The Ion Torrent Personal GenomeMachine (PGM) can do 10 million reads in two hours.

In some instances, high-throughput sequencing involves the use oftechnology available by Helicos BioSciences Corporation (Cambridge,Mass.) such as the Single Molecule Sequencing by Synthesis (SMSS)method. SMSS is unique because it allows for sequencing the entire humangenome in up to 24 hours. Finally, SMSS is powerful because, like the MWtechnology, it does not require a pre amplification step prior tohybridization. In fact, SMSS does not require any amplification.

In some instances, high-throughput sequencing involves the use oftechnology available by 454 Lifesciences, Inc. (Branford, Conn.) such asthe Pico Titer Plate device which includes a fiber optic plate thattransmits chemiluminescent signal generated by the sequencing reactionto be recorded by a CCD camera in the instrument. This use of fiberoptics allows for the detection of a minimum of 20 million base pairs in4.5 hours.

Methods for using bead amplification followed by fiber optics detectionare described in Marguiles, M., et al. “Genome sequencing inmicrofabricated high-density picolitre reactors”, Nature, doi:10.1038/nature03959.

In some instances, high-throughput sequencing is performed using ClonalSingle Molecule Array (Solexa, Inc.) or sequencing-by-synthesis (SBS)utilizing reversible terminator chemistry. Constans, A., The Scientist2003, 17(13):36. High-throughput sequencing of oligonucleotides can beachieved using any suitable sequencing method known in the art, such asthose commercialized by Pacific Biosciences, Complete Genomics, GeniaTechnologies, Halcyon Molecular, Oxford Nanopore Technologies and thelike. Overall such systems involve sequencing a target oligonucleotidemolecule having a plurality of bases by the temporal addition of basesvia a polymerization reaction that is measured on a molecule ofoligonucleotide, i e., the activity of a nucleic acid polymerizingenzyme on the template oligonucleotide molecule to be sequenced isfollowed in real time. Sequence can then be deduced by identifying whichbase is being incorporated into the growing complementary strand of thetarget oligonucleotide by the catalytic activity of the nucleic acidpolymerizing enzyme at each step in the sequence of base additions. Apolymerase on the target oligonucleotide molecule complex is provided ina position suitable to move along the target oligonucleotide moleculeand extend the oligonucleotide primer at an active site. A plurality oflabeled types of nucleotide analogs are provided proximate to the activesite, with each distinguishably type of nucleotide analog beingcomplementary to a different nucleotide in the target oligonucleotidesequence. The growing oligonucleotide strand is extended by using thepolymerase to add a nucleotide analog to the oligonucleotide strand atthe active site, where the nucleotide analog being added iscomplementary to the nucleotide of the target oligonucleotide at theactive site. The nucleotide analog added to the oligonucleotide primeras a result of the polymerizing step is identified. The steps ofproviding labeled nucleotide analogs, polymerizing the growingoligonucleotide strand, and identifying the added nucleotide analog arerepeated so that the oligonucleotide strand is further extended and thesequence of the target oligonucleotide is determined.

The next generation sequencing technique can comprises real-time (SMRT™)technology by Pacific Biosciences. In SMRT, each of four DNA bases canbe attached to one of four different fluorescent dyes. These dyes can bephospho linked. A single DNA polymerase can be immobilized with a singlemolecule of template single stranded DNA at the bottom of a zero-modewaveguide (ZMW). A ZMW can be a confinement structure which enablesobservation of incorporation of a single nucleotide by DNA polymeraseagainst the background of fluorescent nucleotides that can rapidlydiffuse in an out of the ZMW (in microseconds). It can take severalmilliseconds to incorporate a nucleotide into a growing strand. Duringthis time, the fluorescent label can be excited and produce afluorescent signal, and the fluorescent tag can be cleaved off. The ZMWcan be illuminated from below. Attenuated light from an excitation beamcan penetrate the lower 20-30 nm of each ZMW. A microscope with adetection limit of 20 zepto liters (10″ liters) can be created. The tinydetection volume can provide 1000-fold improvement in the reduction ofbackground noise. Detection of the corresponding fluorescence of the dyecan indicate which base was incorporated. The process can be repeated.

In some cases, the next generation sequencing is nanopore sequencing{See e.g., Soni G V and Meller A. (2007) Clin Chem 53: 1996-2001). Ananopore can be a small hole, of the order of about one nanometer indiameter. Immersion of a nanopore in a conducting fluid and applicationof a potential across it can result in a slight electrical current dueto conduction of ions through the nanopore. The amount of current whichflows can be sensitive to the size of the nanopore. As a DNA moleculepasses through a nanopore, each nucleotide on the DNA molecule canobstruct the nanopore to a different degree. Thus, the change in thecurrent passing through the nanopore as the DNA molecule passes throughthe nanopore can represent a reading of the DNA sequence. The nanoporesequencing technology can be from Oxford Nanopore Technologies; e.g., aGridION system. A single nanopore can be inserted in a polymer membraneacross the top of a microwell. Each microwell can have an electrode forindividual sensing. The microwells can be fabricated into an array chip,with 100,000 or more microwells (e.g., more than 200,000, 300,000,400,000, 500,000, 600,000, 700,000, 800,000, 900,000, or 1,000,000) perchip. An instrument (or node) can be used to analyze the chip. Data canbe analyzed in real-time. One or more instruments can be operated at atime. The nanopore can be a protein nanopore, e.g., the proteinalpha-hemolysin, a heptameric protein pore. The nanopore can be asolid-state nanopore made, e.g., a nanometer sized hole formed in asynthetic membrane (e.g., SiN_(x), or SiO₂). The nanopore can be ahybrid pore (e.g., an integration of a protein pore into a solid-statemembrane). The nanopore can be a nanopore with an integrated sensors(e.g., tunneling electrode detectors, capacitive detectors, or graphenebased nano-gap or edge state detectors (see e.g., Garaj et al. (2010)Nature vol. 67, doi: 10.1038/nature09379)). A nanopore can befunctionalized for analyzing a specific type of molecule (e.g., DNA,RNA, or protein). Nanopore sequencing can comprise “strand sequencing”in which intact DNA polymers can be passed through a protein nanoporewith sequencing in real time as the DNA translocates the pore. An enzymecan separate strands of a double stranded DNA and feed a strand througha nanopore. The DNA can have a hairpin at one end, and the system canread both strands. In some cases, nanopore sequencing is “exonucleasesequencing” in which individual nucleotides can be cleaved from a DNAstrand by a processive exonuclease, and the nucleotides can be passedthrough a protein nanopore. The nucleotides can transiently bind to amolecule in the pore (e.g., cyclodextran). A characteristic disruptionin current can be used to identify bases.

Nanopore sequencing technology from GENIA can be used. An engineeredprotein pore can be embedded in a lipid bilayer membrane. “ActiveControl” technology can be used to enable efficient nanopore-membraneassembly and control of DNA movement through the channel. In some cases,the nanopore sequencing technology is from NABsys. Genomic DNA can befragmented into strands of average length of about 100 kb. The 100 kbfragments can be made single stranded and subsequently hybridized with a6-mer probe. The genomic fragments with probes can be driven through ananopore, which can create a current-versus-time tracing. The currenttracing can provide the positions of the probes on each genomicfragment. The genomic fragments can be lined up to create a probe mapfor the genome. The process can be done in parallel for a library ofprobes. A genome-length probe map for each probe can be generated.Errors can be fixed with a process termed “moving window Sequencing ByHybridization (mwSBH).” In some cases, the nanopore sequencingtechnology is from IBM/Roche. An electron beam can be used to make ananopore sized opening in a microchip. An electrical field can be usedto pull or thread DNA through the nanopore. A DNA transistor device inthe nanopore can comprise alternating nanometer sized layers of metaland dielectric. Discrete charges in the DNA backbone can get trapped byelectrical fields inside the DNA nanopore. Turning off and on gatevoltages can allow the DNA sequence to be read.

The next generation sequencing can comprise DNA nanoball sequencing (asperformed, e.g., by Complete Genomics; see e.g., Drmanac et al. (2010)Science 327: 78-81). DNA can be isolated, fragmented, and size selected.For example, DNA can be fragmented (e.g., by sonication) to a meanlength of about 500 bp. Adaptors (Adl) can be attached to the ends ofthe fragments. The adaptors can be used to hybridize to anchors forsequencing reactions. DNA with adaptors bound to each end can be PCRamplified. The adaptor sequences can be modified so that complementarysingle strand ends bind to each other forming circular DNA. The DNA canbe methylated to protect it from cleavage by a type IIS restrictionenzyme used in a subsequent step. An adaptor (e.g., the right adaptor)can have a restriction recognition site, and the restriction recognitionsite can remain non-methylated. The non-methylated restrictionrecognition site in the adaptor can be recognized by a restrictionenzyme (e.g., Acul), and the DNA can be cleaved by Acul 13 bp to theright of the right adaptor to form linear double stranded DNA. A secondround of right and left adaptors (Ad2) can be ligated onto either end ofthe linear DNA, and all DNA with both adapters bound can be PCRamplified (e.g., by PCR). Ad2 sequences can be modified to allow them tobind each other and form circular DNA. The DNA can be methylated, but arestriction enzyme recognition site can remain non-methylated on theleft Adl adapter. A restriction enzyme (e.g., Acul) can be applied, andthe DNA can be cleaved 13 bp to the left of the Adl to form a linear DNAfragment. A third round of right and left adaptor (Ad3) can be ligatedto the right and left flank of the linear DNA, and the resultingfragment can be PCR amplified. The adaptors can be modified so that theycan bind to each other and form circular DNA. A type III restrictionenzyme (e.g., EcoP15) can be added; EcoP15 can cleave the DNA 26 bp tothe left of Ad3 and 26 bp to the right of Ad2. This cleavage can removea large segment of DNA and linearize the DNA once again. A fourth roundof right and left adaptors (Ad4) can be ligated to the DNA, the DNA canbe amplified (e.g., by PCR), and modified so that they bind each otherand form the completed circular DNA template.

Rolling circle replication (e.g., using Phi 29 DNA polymerase) can beused to amplify small fragments of DNA. The four adaptor sequences cancontain palindromic sequences that can hybridize and a single strand canfold onto itself to form a DNA nanoball (DNB™) which can beapproximately 200-300 nanometers in diameter on average. A DNA nanoballcan be attached (e.g., by adsorption) to a microarray (sequencingflowcell). The flow cell can be a silicon wafer coated with silicondioxide, titanium and hexamethyldisilazane (HMDS) and a photoresistmaterial. Sequencing can be performed by unchained sequencing byligating fluorescent probes to the DNA. The color of the fluorescence ofan interrogated position can be visualized by a high resolution camera.The identity of nucleotide sequences between adaptor sequences can bedetermined.

A population of polynucleotides may be enriched prior to adapterligation. In one example, a plurality of polynucleotides is obtainedfrom a sample, fragmented, optionally end-repaired, and denatured athigh temperature, preferably 90-99° C. A polynucleotide targetinglibrary (probe library) is denatured in a hybridization solution at hightemperature, preferably about 90 to 99° C., and combined with thedenatured, tagged polynucleotide library in hybridization solution forabout 10 to 24 hours at about 45 to 80° C. Binding buffer is then addedto the hybridized tagged polynucleotide probes, and a solid supportcomprising a capture moiety are used to selectively bind the hybridizedadapter-tagged polynucleotide-probes. The solid support is washed one ormore times with buffer, preferably about 2 and 5 times to remove unboundpolynucleotides before an elution buffer is added to release theenriched, adapter-tagged polynucleotide fragments from the solidsupport. The enriched polynucleotide fragments are then polyadenylated,adapters are ligated to both ends of the polynucleotide fragments toproduce a library of adapter-tagged polynucleotide strands, and theadapter-tagged polynucleotide library is amplified. The adapter-taggedpolynucleotide library is then sequenced.

A polynucleotide targeting library may also be used to filter undesiredsequences from a plurality of polynucleotides, by hybridizing toundesired fragments. For example, a plurality of polynucleotides isobtained from a sample, and fragmented, optionally end-repaired, andadenylated. Adapters are ligated to both ends of the polynucleotidefragments to produce a library of adapter-tagged polynucleotide strands,and the adapter-tagged polynucleotide library is amplified.Alternatively, adenylation and adapter ligation steps are insteadperformed after enrichment of the sample polynucleotides. Theadapter-tagged polynucleotide library is then denatured at hightemperature, preferably 90-99° C., in the presence of adapter blockers.A polynucleotide filtering library (probe library) designed to removeundesired, non-target sequences is denatured in a hybridization solutionat high temperature, preferably about 90 to 99° C., and combined withthe denatured, tagged polynucleotide library in hybridization solutionfor about 10 to 24 hours at about 45 to 80° C. Binding buffer is thenadded to the hybridized tagged polynucleotide probes, and a solidsupport comprising a capture moiety are used to selectively bind thehybridized adapter-tagged polynucleotide-probes. The solid support iswashed one or more times with buffer, preferably about 1 and 5 times toelute unbound adapter-tagged polynucleotide fragments. The enrichedlibrary of unbound adapter-tagged polynucleotide fragments is amplifiedand then the amplified library is sequenced.

Highly Parallel De Novo Nucleic Acid Synthesis

Described herein is a platform approach utilizing miniaturization,parallelization, and vertical integration of the end-to-end process frompolynucleotide synthesis to gene assembly within Nano wells on siliconto create a revolutionary synthesis platform. Devices described hereinprovide, with the same footprint as a 96-well plate, a silicon synthesisplatform is capable of increasing throughput by a factor of 100 to 1,000compared to traditional synthesis methods, with production of up toapproximately 1,000,000 polynucleotides in a single highly-parallelizedrun. In some instances, a single silicon plate described herein providesfor synthesis of about 6,100 non-identical polynucleotides. In someinstances, each of the non-identical polynucleotides is located within acluster. A cluster may comprise 50 to 500 non-identical polynucleotides.

Methods described herein provide for synthesis of a library ofpolynucleotides each encoding for a predetermined variant of at leastone predetermined reference nucleic acid sequence. In some cases, thepredetermined reference sequence is nucleic acid sequence encoding for aprotein, and the variant library comprises sequences encoding forvariation of at least a single codon such that a plurality of differentvariants of a single residue in the subsequent protein encoded by thesynthesized nucleic acid are generated by standard translationprocesses. The synthesized specific alterations in the nucleic acidsequence can be introduced by incorporating nucleotide changes intooverlapping or blunt ended polynucleotide primers. Alternatively, apopulation of polynucleotides may collectively encode for a long nucleicacid (e.g., a gene) and variants thereof. In this arrangement, thepopulation of polynucleotides can be hybridized and subject to standardmolecular biology techniques to form the long nucleic acid (e.g., agene) and variants thereof. When the long nucleic acid (e.g., a gene)and variants thereof are expressed in cells, a variant protein libraryis generated. Similarly, provided here are methods for synthesis ofvariant libraries encoding for RNA sequences (e.g., miRNA, shRNA, andmRNA) or DNA sequences (e.g., enhancer, promoter, UTR, and terminatorregions). Also provided here are downstream applications for variantsselected out of the libraries synthesized using methods described here.Downstream applications include identification of variant nucleic acidor protein sequences with enhanced biologically relevant functions,e.g., biochemical affinity, enzymatic activity, changes in cellularactivity, and for the treatment or prevention of a disease state.

Substrates

Provided herein are substrates comprising a plurality of clusters,wherein each cluster comprises a plurality of loci that support theattachment and synthesis of polynucleotides. The term “locus” as usedherein refers to a discrete region on a structure which provides supportfor polynucleotides encoding for a single predetermined sequence toextend from the surface. In some instances, a locus is on a twodimensional surface, e.g., a substantially planar surface. In someinstances, a locus refers to a discrete raised or lowered site on asurface e.g., a well, micro well, channel, or post. In some instances, asurface of a locus comprises a material that is actively functionalizedto attach to at least one nucleotide for polynucleotide synthesis, orpreferably, a population of identical nucleotides for synthesis of apopulation of polynucleotides. In some instances, polynucleotide refersto a population of polynucleotides encoding for the same nucleic acidsequence. In some instances, a surface of a device is inclusive of oneor a plurality of surfaces of a substrate.

Provided herein are structures that may comprise a surface that supportsthe synthesis of a plurality of polynucleotides having differentpredetermined sequences at addressable locations on a common support. Insome instances, a device provides support for the synthesis of more than2,000; 5,000; 10,000; 20,000; 30,000; 50,000; 75,000; 100,000; 200,000;300,000; 400,000; 500,000; 600,000; 700,000; 800,000; 900,000;1,000,000; 1,200,000; 1,400,000; 1,600,000; 1,800,000; 2,000,000;2,500,000; 3,000,000; 3,500,000; 4,000,000; 4,500,000; 5,000,000;10,000,000 or more non-identical polynucleotides. In some instances, thedevice provides support for the synthesis of more than 2,000; 5,000;10,000; 20,000; 30,000; 50,000; 75,000; 100,000; 200,000; 300,000;400,000; 500,000; 600,000; 700,000; 800,000; 900,000; 1,000,000;1,200,000; 1,400,000; 1,600,000; 1,800,000; 2,000,000; 2,500,000;3,000,000; 3,500,000; 4,000,000; 4,500,000; 5,000,000; 10,000,000 ormore polynucleotides encoding for distinct sequences. In some instances,at least a portion of the polynucleotides have an identical sequence orare configured to be synthesized with an identical sequence.

Provided herein are methods and devices for manufacture and growth ofpolynucleotides about 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125,150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475,500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700,1800, 1900, or 2000 bases in length. In some instances, the length ofthe polynucleotide formed is about 5, 10, 20, 30, 40, 50, 60, 70, 80,90, 100, 125, 150, 175, 200, or 225 bases in length. A polynucleotidemay be at least 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100 bases inlength. A polynucleotide may be from 10 to 225 bases in length, from 12to 100 bases in length, from 20 to 150 bases in length, from 20 to 130bases in length, or from 30 to 100 bases in length.

In some instances, polynucleotides are synthesized on distinct loci of asubstrate, wherein each locus supports the synthesis of a population ofpolynucleotides. In some instances, each locus supports the synthesis ofa population of polynucleotides having a different sequence than apopulation of polynucleotides grown on another locus. In some instances,the loci of a device are located within a plurality of clusters. In someinstances, a device comprises at least 10, 500, 1000, 2000, 3000, 4000,5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000,20000, 30000, 40000, 50000 or more clusters. In some instances, a devicecomprises more than 2,000; 5,000; 10,000; 100,000; 200,000; 300,000;400,000; 500,000; 600,000; 700,000; 800,000; 900,000; 1,000,000;1,100,000; 1,200,000; 1,300,000; 1,400,000; 1,500,000; 1,600,000;1,700,000; 1,800,000; 1,900,000; 2,000,000; 300,000; 400,000; 500,000;600,000; 700,000; 800,000; 900,000; 1,000,000; 1,200,000; 1,400,000;1,600,000; 1,800,000; 2,000,000; 2,500,000; 3,000,000; 3,500,000;4,000,000; 4,500,000; 5,000,000; or 10,000,000 or more distinct loci. Insome instances, a device comprises about 10,000 distinct loci. Theamount of loci within a single cluster is varied in different instances.In some instances, each cluster includes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 130, 150, 200, 300, 400, 500,1000 or more loci. In some instances, each cluster includes about 50-500loci. In some instances, each cluster includes about 100-200 loci. Insome instances, each cluster includes about 100-150 loci. In someinstances, each cluster includes about 109, 121, 130 or 137 loci. Insome instances, each cluster includes about 19, 20, 61, 64 or more loci.

The number of distinct polynucleotides synthesized on a device may bedependent on the number of distinct loci available in the substrate. Insome instances, the density of loci within a cluster of a device is atleast or about 1 locus per mm², 10 loci per mm², 25 loci per mm², 50loci per mm², 65 loci per mm², 75 loci per mm², 100 loci per mm², 130loci per mm², 150 loci per mm², 175 loci per mm², 200 loci per mm², 300loci per mm², 400 loci per mm², 500 loci per mm², 1,000 loci per mm² ormore. In some instances, a device comprises from about 10 loci per mm²to about 500 mm², from about 25 loci per mm² to about 400 mm², fromabout 50 loci per mm² to about 500 mm², from about 100 loci per mm² toabout 500 mm², from about 150 loci per mm² to about 500 mm², from about10 loci per mm² to about 250 mm², from about 50 loci per mm² to about250 mm², from about 10 loci per mm² to about 200 mm², or from about 50loci per mm² to about 200 mm². In some instances, the distance from thecenters of two adjacent loci within a cluster is from about 10 um toabout 500 um, from about 10 um to about 200 um, or from about 10 um toabout 100 um. In some instances, the distance from two centers ofadjacent loci is greater than about 10 um, 20 um, 30 um, 40 um, 50 um,60 um, 70 um, 80 um, 90 um or 100 um. In some instances, the distancefrom the centers of two adjacent loci is less than about 200 um, 150 um,100 um, 80 um, 70 um, 60 um, 50 um, 40 um, 30 um, 20 um or 10 um. Insome instances, each locus has a width of about 0.5 um, 1 um, 2 um, 3um, 4 um, 5 um, 6 um, 7 um, 8 um, 9 um, 10 um, 20 um, 30 um, 40 um, 50um, 60 um, 70 um, 80 um, 90 um or 100 um. In some instances, each locusis has a width of about 0.5 um to 100 um, about 0.5 um to 50 um, about10 um to 75 um, or about 0.5 um to 50 um.

In some instances, the density of clusters within a device is at leastor about 1 cluster per 100 mm², 1 cluster per 10 mm², 1 cluster per 5mm², 1 cluster per 4 mm², 1 cluster per 3 mm², 1 cluster per 2 mm², 1cluster per 1 mm², 2 clusters per 1 mm², 3 clusters per 1 mm², 4clusters per 1 mm², 5 clusters per 1 mm², 10 clusters per 1 mm², 50clusters per 1 mm² or more. In some instances, a device comprises fromabout 1 cluster per 10 mm² to about 10 clusters per 1 mm². In someinstances, the distance from the centers of two adjacent clusters isless than about 50 um, 100 um, 200 um, 500 um, 1000 um, or 2000 um or5000 um. In some instances, the distance from the centers of twoadjacent clusters is from about 50 um and about 100 um, from about 50 umand about 200 um, from about 50 um and about 300 um, from about 50 umand about 500 um, and from about 100 μm to about 2000 um. In someinstances, the distance from the centers of two adjacent clusters isfrom about 0.05 mm to about 50 mm, from about 0.05 mm to about 10 mm,from about 0.05 mm and about 5 mm, from about 0.05 mm and about 4 mm,from about 0.05 mm and about 3 mm, from about 0.05 mm and about 2 mm,from about 0.1 mm and 10 mm, from about 0.2 mm and 10 mm, from about 0.3mm and about 10 mm, from about 0.4 mm and about 10 mm, from about 0.5 mmand 10 mm, from about 0.5 mm and about 5 mm, or from about 0.5 mm andabout 2 mm. In some instances, each cluster has a diameter or widthalong one dimension of about 0.5 to 2 mm, about 0.5 to 1 mm, or about 1to 2 mm. In some instances, each cluster has a diameter or width alongone dimension of about 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4,1.5, 1.6, 1.7, 1.8, 1.9 or 2 mm. In some instances, each cluster has aninterior diameter or width along one dimension of about 0.5, 0.6, 0.7,0.8, 0.9, 1, 1.1, 1.15, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9 or 2 mm.

A device may be about the size of a standard 96 well plate, for examplefrom about 100 and 200 mm by from about 50 and 150 mm. In someinstances, a device has a diameter less than or equal to about 1000 mm,500 mm, 450 mm, 400 mm, 300 mm, 250 nm, 200 mm, 150 mm, 100 mm or 50 mm.In some instances, the diameter of a device is from about 25 mm and 1000mm, from about 25 mm and about 800 mm, from about 25 mm and about 600mm, from about 25 mm and about 500 mm, from about 25 mm and about 400mm, from about 25 mm and about 300 mm, or from about 25 mm and about200. Non-limiting examples of device size include about 300 mm, 200 mm,150 mm, 130 mm, 100 mm, 76 mm, 51 mm and 25 mm. In some instances, adevice has a planar surface area of at least about 100 mm²; 200 mm²; 500mm²; 1,000 mm²; 2,000 mm²; 5,000 mm²; 10,000 mm²; 12,000 mm²; 15,000mm²; 20,000 mm²; 30,000 mm²; 40,000 mm²; 50,000 mm² or more. In someinstances, the thickness of a device is from about 50 mm and about 2000mm, from about 50 mm and about 1000 mm, from about 100 mm and about 1000mm, from about 200 mm and about 1000 mm, or from about 250 mm and about1000 mm. Non-limiting examples of device thickness include 275 mm, 375mm, 525 mm, 625 mm, 675 mm, 725 mm, 775 mm and 925 mm. In someinstances, the thickness of a device varies with diameter and depends onthe composition of the substrate. For example, a device comprisingmaterials other than silicon has a different thickness than a silicondevice of the same diameter. Device thickness may be determined by themechanical strength of the material used and the device must be thickenough to support its own weight without cracking during handling. Insome instances, a structure comprises a plurality of devices describedherein.

Surface Materials

Provided herein is a device comprising a surface, wherein the surface ismodified to support polynucleotide synthesis at predetermined locationsand with a resulting low error rate, a low dropout rate, a high yield,and a high oligo representation. In some instances, surfaces of a devicefor polynucleotide synthesis provided herein are fabricated from avariety of materials capable of modification to support a de novopolynucleotide synthesis reaction. In some cases, the devices aresufficiently conductive, e.g., are able to form uniform electric fieldsacross all or a portion of the device. A device described herein maycomprise a flexible material. Exemplary flexible materials include,without limitation, modified nylon, unmodified nylon, nitrocellulose,and polypropylene. A device described herein may comprise a rigidmaterial. Exemplary rigid materials include, without limitation, glass,fuse silica, silicon, silicon dioxide, silicon nitride, plastics (forexample, polytetrafluoroethylene, polypropylene, polystyrene,polycarbonate, and blends thereof, and metals (for example, gold,platinum). Device disclosed herein may be fabricated from a materialcomprising silicon, polystyrene, agarose, dextran, cellulosic polymers,polyacrylamides, polydimethylsiloxane (PDMS), glass, or any combinationthereof. In some cases, a device disclosed herein is manufactured with acombination of materials listed herein or any other suitable materialknown in the art.

A listing of tensile strengths for exemplary materials described hereinis provides as follows: nylon (70 MPa), nitrocellulose (1.5 MPa),polypropylene (40 MPa), silicon (268 MPa), polystyrene (40 MPa), agarose(1-10 MPa), polyacrylamide (1-10 MPa), polydimethylsiloxane (PDMS)(3.9-10.8 MPa). Solid supports described herein can have a tensilestrength from 1 to 300, 1 to 40, 1 to 10, 1 to 5, or 3 to 11 MPa. Solidsupports described herein can have a tensile strength of about 1, 1.5,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 20, 25, 40, 50, 60, 70, 80, 90, 100,150, 200, 250, 270, or more MPa. In some instances, a device describedherein comprises a solid support for polynucleotide synthesis that is inthe form of a flexible material capable of being stored in a continuousloop or reel, such as a tape or flexible sheet.

Young's modulus measures the resistance of a material to elastic(recoverable) deformation under load. A listing of Young's modulus forstiffness of exemplary materials described herein is provides asfollows: nylon (3 GPa), nitrocellulose (1.5 GPa), polypropylene (2 GPa),silicon (150 GPa), polystyrene (3 GPa), agarose (1-10 GPa),polyacrylamide (1-10 GPa), polydimethylsiloxane (PDMS) (1-10 GPa). Solidsupports described herein can have a Young's moduli from 1 to 500, 1 to40, 1 to 10, 1 to 5, or 3 to 11 GPa. Solid supports described herein canhave a Young's moduli of about 1, 1.5, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,20, 25, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 400, 500 GPa, ormore. As the relationship between flexibility and stiffness are inverseto each other, a flexible material has a low Young's modulus and changesits shape considerably under load.

In some cases, a device disclosed herein comprises a silicon dioxidebase and a surface layer of silicon oxide. Alternatively, the device mayhave a base of silicon oxide. Surface of the device provided here may betextured, resulting in an increase overall surface area forpolynucleotide synthesis. Device disclosed herein may comprise at least5%, 10%, 25%, 50%, 80%, 90%, 95%, or 99% silicon. A device disclosedherein may be fabricated from a silicon on insulator (SOI) wafer.

Surface Architecture

Provided herein are devices comprising raised and/or lowered features.One benefit of having such features is an increase in surface area tosupport polynucleotide synthesis. In some instances, a device havingraised and/or lowered features is referred to as a three-dimensionalsubstrate. In some instances, a three-dimensional device comprises oneor more channels. In some instances, one or more loci comprise achannel. In some instances, the channels are accessible to reagentdeposition via a deposition device such as a polynucleotide synthesizer.In some instances, reagents and/or fluids collect in a larger well influid communication one or more channels. For example, a devicecomprises a plurality of channels corresponding to a plurality of lociwith a cluster, and the plurality of channels are in fluid communicationwith one well of the cluster. In some methods, a library ofpolynucleotides is synthesized in a plurality of loci of a cluster.

In some instances, the structure is configured to allow for controlledflow and mass transfer paths for polynucleotide synthesis on a surface.In some instances, the configuration of a device allows for thecontrolled and even distribution of mass transfer paths, chemicalexposure times, and/or wash efficacy during polynucleotide synthesis. Insome instances, the configuration of a device allows for increased sweepefficiency, for example by providing sufficient volume for a growing apolynucleotide such that the excluded volume by the growingpolynucleotide does not take up more than 50, 45, 40, 35, 30, 25, 20,15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1%, or less of theinitially available volume that is available or suitable for growing thepolynucleotide. In some instances, a three-dimensional structure allowsfor managed flow of fluid to allow for the rapid exchange of chemicalexposure.

Provided herein are methods to synthesize an amount of DNA of 1 fM, 5fM, 10 fM, 25 fM, 50 fM, 75 fM, 100 fM, 200 fM, 300 fM, 400 fM, 500 fM,600 fM, 700 fM, 800 fM, 900 fM, 1 pM, 5 pM, 10 pM, 25 pM, 50 pM, 75 pM,100 pM, 200 pM, 300 pM, 400 pM, 500 pM, 600 pM, 700 pM, 800 pM, 900 pM,or more. In some instances, a polynucleotide library may span the lengthof about 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%,80%, 90%, 95%, or 100% of a gene. A gene may be varied up to about 1%,2%, 3%, 4%, 5%, 10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 85%, 90%,95%, or 100%.

Non-identical polynucleotides may collectively encode a sequence for atleast 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80%,85%, 90%, 95%, or 100% of a gene. In some instances, a polynucleotidemay encode a sequence of 50%, 60%, 70%, 80%, 85%, 90%, 95%, or more of agene. In some instances, a polynucleotide may encode a sequence of 80%,85%, 90%, 95%, or more of a gene.

In some instances, segregation is achieved by physical structure. Insome instances, segregation is achieved by differentialfunctionalization of the surface generating active and passive regionsfor polynucleotide synthesis. Differential functionalization is also beachieved by alternating the hydrophobicity across the device surface,thereby creating water contact angle effects that cause beading orwetting of the deposited reagents. Employing larger structures candecrease splashing and cross-contamination of distinct polynucleotidesynthesis locations with reagents of the neighboring spots. In someinstances, a device, such as a polynucleotide synthesizer, is used todeposit reagents to distinct polynucleotide synthesis locations.Substrates having three-dimensional features are configured in a mannerthat allows for the synthesis of a large number of polynucleotides(e.g., more than about 10,000) with a low error rate (e.g., less thanabout 1:500, 1:1000, 1:1500, 1:2,000; 1:3,000; 1:5,000; or 1:10,000). Insome instances, a device comprises features with a density of about orgreater than about 1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 100, 110, 120,130, 140, 150, 160, 170, 180, 190, 200, 300, 400 or 500 features permm².

A well of a device may have the same or different width, height, and/orvolume as another well of the substrate. A channel of a device may havethe same or different width, height, and/or volume as another channel ofthe substrate. In some instances, the width of a cluster is from about0.05 mm to about 50 mm, from about 0.05 mm to about 10 mm, from about0.05 mm and about 5 mm, from about 0.05 mm and about 4 mm, from about0.05 mm and about 3 mm, from about 0.05 mm and about 2 mm, from about0.05 mm and about 1 mm, from about 0.05 mm and about 0.5 mm, from about0.05 mm and about 0.1 mm, from about 0.1 mm and 10 mm, from about 0.2 mmand 10 mm, from about 0.3 mm and about 10 mm, from about 0.4 mm andabout 10 mm, from about 0.5 mm and 10 mm, from about 0.5 mm and about 5mm, or from about 0.5 mm and about 2 mm. In some instances, the width ofa well comprising a cluster is from about 0.05 mm to about 50 mm, fromabout 0.05 mm to about 10 mm, from about 0.05 mm and about 5 mm, fromabout 0.05 mm and about 4 mm, from about 0.05 mm and about 3 mm, fromabout 0.05 mm and about 2 mm, from about 0.05 mm and about 1 mm, fromabout 0.05 mm and about 0.5 mm, from about 0.05 mm and about 0.1 mm,from about 0.1 mm and 10 mm, from about 0.2 mm and 10 mm, from about 0.3mm and about 10 mm, from about 0.4 mm and about 10 mm, from about 0.5 mmand 10 mm, from about 0.5 mm and about 5 mm, or from about 0.5 mm andabout 2 mm. In some instances, the width of a cluster is less than orabout 5 mm, 4 mm, 3 mm, 2 mm, 1 mm, 0.5 mm, 0.1 mm, 0.09 mm, 0.08 mm,0.07 mm, 0.06 mm or 0.05 mm. In some instances, the width of a clusteris from about 1.0 and 1.3 mm. In some instances, the width of a clusteris about 1.150 mm. In some instances, the width of a well is less thanor about 5 mm, 4 mm, 3 mm, 2 mm, 1 mm, 0.5 mm, 0.1 mm, 0.09 mm, 0.08 mm,0.07 mm, 0.06 mm or 0.05 mm. In some instances, the width of a well isfrom about 1.0 and 1.3 mm. In some instances, the width of a well isabout 1.150 mm. In some instances, the width of a cluster is about 0.08mm. In some instances, the width of a well is about 0.08 mm. The widthof a cluster may refer to clusters within a two-dimensional orthree-dimensional substrate.

In some instances, the height of a well is from about 20 um to about1000 um, from about 50 um to about 1000 um, from about 100 μm to about1000 um, from about 200 μm to about 1000 um, from about 300 μm to about1000 um, from about 400 μm to about 1000 um, or from about 500 μm toabout 1000 um. In some instances, the height of a well is less thanabout 1000 um, less than about 900 um, less than about 800 um, less thanabout 700 um, or less than about 600 um.

In some instances, a device comprises a plurality of channelscorresponding to a plurality of loci within a cluster, wherein theheight or depth of a channel is from about 5 um to about 500 um, fromabout 5 um to about 400 um, from about 5 um to about 300 um, from about5 um to about 200 um, from about 5 um to about 100 um, from about 5 umto about 50 um, or from about 10 um to about 50 um. In some instances,the height of a channel is less than 100 um, less than 80 um, less than60 um, less than 40 um or less than 20 um.

In some instances, the diameter of a channel, locus (e.g., in asubstantially planar substrate) or both channel and locus (e.g., in athree-dimensional device wherein a locus corresponds to a channel) isfrom about 1 um to about 1000 um, from about 1 um to about 500 um, fromabout 1 um to about 200 um, from about 1 um to about 100 um, from about5 um to about 100 um, or from about 10 um to about 100 um, for example,about 90 um, 80 um, 70 um, 60 um, 50 um, 40 um, 30 um, 20 um or 10 um.In some instances, the diameter of a channel, locus, or both channel andlocus is less than about 100 um, 90 um, 80 um, 70 um, 60 um, 50 um, 40um, 30 um, 20 um or 10 um. In some instances, the distance from thecenter of two adjacent channels, loci, or channels and loci is fromabout 1 um to about 500 um, from about 1 um to about 200 um, from about1 um to about 100 um, from about 5 um to about 200 um, from about 5 umto about 100 um, from about 5 um to about 50 um, or from about 5 um toabout 30 um, for example, about 20 um.

Surface Modifications

In various instances, surface modifications are employed for thechemical and/or physical alteration of a surface by an additive orsubtractive process to change one or more chemical and/or physicalproperties of a device surface or a selected site or region of a devicesurface. For example, surface modifications include, without limitation,(1) changing the wetting properties of a surface, (2) functionalizing asurface, i.e., providing, modifying or substituting surface functionalgroups, (3) defunctionalizing a surface, i.e., removing surfacefunctional groups, (4) otherwise altering the chemical composition of asurface, e.g., through etching, (5) increasing or decreasing surfaceroughness, (6) providing a coating on a surface, e.g., a coating thatexhibits wetting properties that are different from the wettingproperties of the surface, and/or (7) depositing particulates on asurface.

In some instances, the addition of a chemical layer on top of a surface(referred to as adhesion promoter) facilitates structured patterning ofloci on a surface of a substrate. Exemplary surfaces for application ofadhesion promotion include, without limitation, glass, silicon, silicondioxide and silicon nitride. In some instances, the adhesion promoter isa chemical with a high surface energy. In some instances, a secondchemical layer is deposited on a surface of a substrate. In someinstances, the second chemical layer has a low surface energy. In someinstances, surface energy of a chemical layer coated on a surfacesupports localization of droplets on the surface. Depending on thepatterning arrangement selected, the proximity of loci and/or area offluid contact at the loci are alterable.

In some instances, a device surface, or resolved loci, onto whichnucleic acids or other moieties are deposited, e.g., for polynucleotidesynthesis, are smooth or substantially planar (e.g., two-dimensional) orhave irregularities, such as raised or lowered features (e.g.,three-dimensional features). In some instances, a device surface ismodified with one or more different layers of compounds. Suchmodification layers of interest include, without limitation, inorganicand organic layers such as metals, metal oxides, polymers, small organicmolecules and the like. Non-limiting polymeric layers include peptides,proteins, nucleic acids or mimetics thereof (e.g., peptide nucleic acidsand the like), polysaccharides, phospholipids, polyurethanes,polyesters, polycarbonates, polyureas, polyamides, polyethyleneamines,polyarylene sulfides, polysiloxanes, polyimides, polyacetates, and anyother suitable compounds described herein or otherwise known in the art.In some instances, polymers are heteropolymeric. In some instances,polymers are homopolymeric. In some instances, polymers comprisefunctional moieties or are conjugated.

In some instances, resolved loci of a device are functionalized with oneor more moieties that increase and/or decrease surface energy. In someinstances, a moiety is chemically inert. In some instances, a moiety isconfigured to support a desired chemical reaction, for example, one ormore processes in a polynucleotide synthesis reaction. The surfaceenergy, or hydrophobicity, of a surface is a factor for determining theaffinity of a nucleotide to attach onto the surface. In some instances,a method for device functionalization may comprise: (a) providing adevice having a surface that comprises silicon dioxide; and (b)silanizing the surface using, a suitable silanizing agent describedherein or otherwise known in the art, for example, an organofunctionalalkoxysilane molecule.

In some instances, the organofunctional alkoxysilane molecule comprisesdimethylchloro-octodecyl-silane, methyldichloro-octodecyl-silane,trichloro-octodecyl-silane, trimethyl-octodecyl-silane,triethyl-octodecyl-silane, or any combination thereof. In someinstances, a device surface comprises functionalized withpolyethylene/polypropylene (functionalized by gamma irradiation orchromic acid oxidation, and reduction to hydroxyalkyl surface), highlycrosslinked polystyrene-divinylbenzene (derivatized bychloromethylation, and aminated to benzylamine functional surface),nylon (the terminal aminohexyl groups are directly reactive), or etchedwith reduced polytetrafluoroethylene. Other methods and functionalizingagents are described in U.S. Pat. No. 5,474,796, which is hereinincorporated by reference in its entirety.

In some instances, a device surface is functionalized by contact with aderivatizing composition that contains a mixture of silanes, underreaction conditions effective to couple the silanes to the devicesurface, typically via reactive hydrophilic moieties present on thedevice surface. Silanization generally covers a surface throughself-assembly with organofunctional alkoxysilane molecules.

A variety of siloxane functionalizing reagents can further be used ascurrently known in the art, e.g., for lowering or increasing surfaceenergy. The organofunctional alkoxysilanes can be classified accordingto their organic functions.

Provided herein are devices that may contain patterning of agentscapable of coupling to a nucleoside. In some instances, a device may becoated with an active agent. In some instances, a device may be coatedwith a passive agent. Exemplary active agents for inclusion in coatingmaterials described herein includes, without limitation,N-(3-triethoxysilylpropyl)-4-hydroxybutyramide (HAPS),11-acetoxyundecyltriethoxysilane, n-decyltriethoxysilane,(3-aminopropyl)trimethoxysilane, (3-aminopropyl)triethoxysilane,3-glycidoxypropyltrimethoxysilane (GOPS), 3-iodo-propyltrimethoxysilane,butyl-aldehydr-trimethoxysilane, dimeric secondary aminoalkyl siloxanes,(3-aminopropyl)-diethoxy-methylsilane,(3-aminopropyl)-dimethyl-ethoxysilane, and(3-aminopropyl)-trimethoxysilane,(3-glycidoxypropyl)-dimethyl-ethoxysilane, glycidoxy-trimethoxysilane,(3-mercaptopropyl)-trimethoxysilane, 3-4epoxycyclohexyl-ethyltrimethoxysilane, and(3-mercaptopropyl)-methyl-dimethoxysilane, allyl trichlorochlorosilane,7-oct-1-enyl trichlorochlorosilane, or bis (3-trimethoxysilylpropyl)amine.

Exemplary passive agents for inclusion in a coating material describedherein includes, without limitation, perfluorooctyltrichlorosilane;tridecafluoro-1,1,2,2-tetrahydrooctyl)trichlorosilane; 1H, 1H, 2H,2H-fluorooctyltriethoxysilane (FOS); trichloro(1H, 1H, 2H,2H-perfluorooctyl)silane;tert-butyl-[5-fluoro-4-(4,4,5,5-tetramethyl-1,3,2-dioxaborolan-2-yl)indol-1-yl]-dimethyl-silane;CYTOP™; Fluorinert™; perfluoroctyltrichlorosilane (PFOTCS);perfluorooctyldimethylchlorosilane (PFODCS);perfluorodecyltriethoxysilane (PFDTES);pentafluorophenyl-dimethylpropylchloro-silane (PFPTES);perfluorooctyltriethoxysilane; perfluorooctyltrimethoxysilane;octylchlorosilane; dimethylchloro-octodecyl-silane;methyldichloro-octodecyl-silane; trichloro-octodecyl-silane;trimethyl-octodecyl-silane; triethyl-octodecyl-silane; oroctadecyltrichlorosilane.

In some instances, a functionalization agent comprises a hydrocarbonsilane such as octadecyltrichlorosilane. In some instances, thefunctionalizing agent comprises 11-acetoxyundecyltriethoxysilane,n-decyltriethoxysilane, (3-aminopropyl)trimethoxysilane,(3-aminopropyl)triethoxysilane, glycidyloxypropyl/trimethoxysilane andN-(3-triethoxysilylpropyl)-4-hydroxybutyramide.

Polynucleotide Synthesis

Methods of the current disclosure for polynucleotide synthesis mayinclude processes involving phosphoramidite chemistry. In someinstances, polynucleotide synthesis comprises coupling a base withphosphoramidite. Polynucleotide synthesis may comprise coupling a baseby deposition of phosphoramidite under coupling conditions, wherein thesame base is optionally deposited with phosphoramidite more than once,i.e., double coupling. Polynucleotide synthesis may comprise capping ofunreacted sites. In some instances, capping is optional. Polynucleotidesynthesis may also comprise oxidation or an oxidation step or oxidationsteps. Polynucleotide synthesis may comprise deblocking, detritylation,and sulfurization. In some instances, polynucleotide synthesis compriseseither oxidation or sulfurization. In some instances, between one oreach step during a polynucleotide synthesis reaction, the device iswashed, for example, using tetrazole or acetonitrile. Time frames forany one step in a phosphoramidite synthesis method may be less thanabout 2 minutes, 1 minute, 50 seconds, 40 seconds, 30 seconds, 20seconds and 10 seconds.

Polynucleotide synthesis using a phosphoramidite method may comprise asubsequent addition of a phosphoramidite building block (e.g.,nucleoside phosphoramidite) to a growing polynucleotide chain for theformation of a phosphite triester linkage. Phosphoramiditepolynucleotide synthesis proceeds in the 3′ to 5′ direction.Phosphoramidite polynucleotide synthesis allows for the controlledaddition of one nucleotide to a growing nucleic acid chain per synthesiscycle. In some instances, each synthesis cycle comprises a couplingstep. Phosphoramidite coupling involves the formation of a phosphitetriester linkage between an activated nucleoside phosphoramidite and anucleoside bound to the substrate, for example, via a linker. In someinstances, the nucleoside phosphoramidite is provided to the deviceactivated. In some instances, the nucleoside phosphoramidite is providedto the device with an activator. In some instances, nucleosidephosphoramidites are provided to the device in a 1.5, 2, 3, 4, 5, 6, 7,8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 50,60, 70, 80, 90, 100-fold excess or more over the substrate-boundnucleosides. In some instances, the addition of nucleosidephosphoramidite is performed in an anhydrous environment, for example,in anhydrous acetonitrile. Following addition of a nucleosidephosphoramidite, the device is optionally washed. In some instances, thecoupling step is repeated one or more additional times, optionally witha wash step between nucleoside phosphoramidite additions to thesubstrate. In some instances, a polynucleotide synthesis method usedherein comprises 1, 2, 3 or more sequential coupling steps. Prior tocoupling, in many cases, the nucleoside bound to the device isde-protected by removal of a protecting group, where the protectinggroup functions to prevent polymerization. A common protecting group is4,4′-dimethoxytrityl (DMT).

Following coupling, phosphoramidite polynucleotide synthesis methodsoptionally comprise a capping step. In a capping step, the growingpolynucleotide is treated with a capping agent. A capping step is usefulto block unreacted substrate-bound 5′—OH groups after coupling fromfurther chain elongation, preventing the formation of polynucleotideswith internal base deletions. Further, phosphoramidites activated with1H-tetrazole may react, to a small extent, with the O6 position ofguanosine. Without being bound by theory, upon oxidation with I₂/water,this side product, possibly via O6-N7 migration, may undergodepurination. The apurinic sites may end up being cleaved in the courseof the final deprotection of the polynucleotide thus reducing the yieldof the full-length product. The O6 modifications may be removed bytreatment with the capping reagent prior to oxidation with I₂/water. Insome instances, inclusion of a capping step during polynucleotidesynthesis decreases the error rate as compared to synthesis withoutcapping. As an example, the capping step comprises treating thesubstrate-bound polynucleotide with a mixture of acetic anhydride and1-methylimidazole. Following a capping step, the device is optionallywashed.

In some instances, following addition of a nucleoside phosphoramidite,and optionally after capping and one or more wash steps, the devicebound growing nucleic acid is oxidized. The oxidation step comprises thephosphite triester is oxidized into a tetracoordinated phosphatetriester, a protected precursor of the naturally occurring phosphatediester internucleoside linkage. In some instances, oxidation of thegrowing polynucleotide is achieved by treatment with iodine and water,optionally in the presence of a weak base (e.g., pyridine, lutidine,collidine). Oxidation may be carried out under anhydrous conditionsusing, e.g. tert-Butyl hydroperoxide or(1S)-(+)-(10-camphorsulfonyl)-oxaziridine (CSO). In some methods, acapping step is performed following oxidation. A second capping stepallows for device drying, as residual water from oxidation that maypersist can inhibit subsequent coupling. Following oxidation, the deviceand growing polynucleotide is optionally washed. In some instances, thestep of oxidation is substituted with a sulfurization step to obtainpolynucleotide phosphorothioates, wherein any capping steps can beperformed after the sulfurization. Many reagents are capable of theefficient sulfur transfer, including but not limited to3-(Dimethylaminomethylidene)amino)-3H-1,2,4-dithiazole-3-thione, DDTT,3H-1,2-benzodithiol-3-one 1,1-dioxide, also known as Beaucage reagent,and N,N,N′N′-Tetraethylthiuram disulfide (TETD).

In order for a subsequent cycle of nucleoside incorporation to occurthrough coupling, the protected 5′ end of the device bound growingpolynucleotide is removed so that the primary hydroxyl group is reactivewith a next nucleoside phosphoramidite. In some instances, theprotecting group is DMT and deblocking occurs with trichloroacetic acidin dichloromethane. Conducting detritylation for an extended time orwith stronger than recommended solutions of acids may lead to increaseddepurination of solid support-bound polynucleotide and thus reduces theyield of the desired full-length product. Methods and compositions ofthe disclosure described herein provide for controlled deblockingconditions limiting undesired depurination reactions. In some instances,the device bound polynucleotide is washed after deblocking. In someinstances, efficient washing after deblocking contributes to synthesizedpolynucleotides having a low error rate.

Methods for the synthesis of polynucleotides typically involve aniterating sequence of the following steps: application of a protectedmonomer to an actively functionalized surface (e.g., locus) to link witheither the activated surface, a linker or with a previously deprotectedmonomer; deprotection of the applied monomer so that it is reactive witha subsequently applied protected monomer; and application of anotherprotected monomer for linking. One or more intermediate steps includeoxidation or sulfurization. In some instances, one or more wash stepsprecede or follow one or all of the steps.

Methods for phosphoramidite-based polynucleotide synthesis comprise aseries of chemical steps. In some instances, one or more steps of asynthesis method involve reagent cycling, where one or more steps of themethod comprise application to the device of a reagent useful for thestep. For example, reagents are cycled by a series of liquid depositionand vacuum drying steps. For substrates comprising three-dimensionalfeatures such as wells, microwells, channels and the like, reagents areoptionally passed through one or more regions of the device via thewells and/or channels.

Methods and systems described herein relate to polynucleotide synthesisdevices for the synthesis of polynucleotides. The synthesis may be inparallel. For example at least or about at least 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35,40, 45, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650,700, 750, 800, 850, 900, 1000, 10000, 50000, 75000, 100000 or morepolynucleotides can be synthesized in parallel. The total numberpolynucleotides that may be synthesized in parallel may be from2-100000, 3-50000, 4-10000, 5-1000, 6-900, 7-850, 8-800, 9-750, 10-700,11-650, 12-600, 13-550, 14-500, 15-450, 16-400, 17-350, 18-300, 19-250,20-200, 21-150, 22-100, 23-50, 24-45, 25-40, 30-35. Those of skill inthe art appreciate that the total number of polynucleotides synthesizedin parallel may fall within any range bound by any of these values, forexample 25-100. The total number of polynucleotides synthesized inparallel may fall within any range defined by any of the values servingas endpoints of the range. Total molar mass of polynucleotidessynthesized within the device or the molar mass of each of thepolynucleotides may be at least or at least about 10, 20, 30, 40, 50,100, 250, 500, 750, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000,9000, 10000, 25000, 50000, 75000, 100000 picomoles, or more. The lengthof each of the polynucleotides or average length of the polynucleotideswithin the device may be at least or about at least 10, 15, 20, 25, 30,35, 40, 45, 50, 100, 150, 200, 300, 400, 500 nucleotides, or more. Thelength of each of the polynucleotides or average length of thepolynucleotides within the device may be at most or about at most 500,400, 300, 200, 150, 100, 50, 45, 35, 30, 25, 20, 19, 18, 17, 16, 15, 14,13, 12, 11, 10 nucleotides, or less. The length of each of thepolynucleotides or average length of the polynucleotides within thedevice may fall from 10-500, 9-400, 11-300, 12-200, 13-150, 14-100,15-50, 16-45, 17-40, 18-35, 19-25. Those of skill in the art appreciatethat the length of each of the polynucleotides or average length of thepolynucleotides within the device may fall within any range bound by anyof these values, for example 100-300. The length of each of thepolynucleotides or average length of the polynucleotides within thedevice may fall within any range defined by any of the values serving asendpoints of the range.

Methods for polynucleotide synthesis on a surface provided herein allowfor synthesis at a fast rate. As an example, at least 3, 4, 5, 6, 7, 8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 70, 80, 90, 100, 125, 150, 175,200 nucleotides per hour, or more are synthesized. Nucleotides includeadenine, guanine, thymine, cytosine, uridine building blocks, oranalogs/modified versions thereof. In some instances, libraries ofpolynucleotides are synthesized in parallel on substrate. For example, adevice comprising about or at least about 100; 1,000; 10,000; 30,000;75,000; 100,000; 1,000,000; 2,000,000; 3,000,000; 4,000,000; or5,000,000 resolved loci is able to support the synthesis of at least thesame number of distinct polynucleotides, wherein polynucleotide encodinga distinct sequence is synthesized on a resolved locus. In someinstances, a library of polynucleotides are synthesized on a device withlow error rates described herein in less than about three months, twomonths, one month, three weeks, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5,4, 3, 2 days, 24 hours or less. In some instances, larger nucleic acidsassembled from a polynucleotide library synthesized with low error rateusing the substrates and methods described herein are prepared in lessthan about three months, two months, one month, three weeks, 15, 14, 13,12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2 days, 24 hours or less.

In some instances, methods described herein provide for generation of alibrary of polynucleotides comprising variant polynucleotides differingat a plurality of codon sites. In some instances, a polynucleotide mayhave 1 site, 2 sites, 3 sites, 4 sites, 5 sites, 6 sites, 7 sites, 8sites, 9 sites, 10 sites, 11 sites, 12 sites, 13 sites, 14 sites, 15sites, 16 sites, 17 sites 18 sites, 19 sites, 20 sites, 30 sites, 40sites, 50 sites, or more of variant codon sites.

In some instances, the one or more sites of variant codon sites may beadjacent. In some instances, the one or more sites of variant codonsites may be not be adjacent and separated by 1, 2, 3, 4, 5, 6, 7, 8, 9,10, or more codons.

In some instances, a polynucleotide may comprise multiple sites ofvariant codon sites, wherein all the variant codon sites are adjacent toone another, forming a stretch of variant codon sites. In someinstances, a polynucleotide may comprise multiple sites of variant codonsites, wherein none the variant codon sites are adjacent to one another.In some instances, a polynucleotide may comprise multiple sites ofvariant codon sites, wherein some the variant codon sites are adjacentto one another, forming a stretch of variant codon sites, and some ofthe variant codon sites are not adjacent to one another.

Large Polynucleotide Libraries Having Low Error Rates

Average error rates for polynucleotides synthesized within a libraryusing the systems and methods provided may be less than 1 in 1000, lessthan 1 in 1250, less than 1 in 1500, less than 1 in 2000, less than 1 in3000 or less often. In some instances, average error rates forpolynucleotides synthesized within a library using the systems andmethods provided are less than 1/500, 1/600, 1/700, 1/800, 1/900,1/1000, 1/1100, 1/1200, 1/1250, 1/1300, 1/1400, 1/1500, 1/1600, 1/1700,1/1800, 1/1900, 1/2000, 1/3000, or less. In some instances, averageerror rates for polynucleotides synthesized within a library using thesystems and methods provided are less than 1/1000.

In some instances, aggregate error rates for polynucleotides synthesizedwithin a library using the systems and methods provided are less than1/500, 1/600, 1/700, 1/800, 1/900, 1/1000, 1/1100, 1/1200, 1/1250,1/1300, 1/1400, 1/1500, 1/1600, 1/1700, 1/1800, 1/1900, 1/2000, 1/3000,or less compared to the predetermined sequences. In some instances,aggregate error rates for polynucleotides synthesized within a libraryusing the systems and methods provided are less than 1/500, 1/600,1/700, 1/800, 1/900, or 1/1000. In some instances, aggregate error ratesfor polynucleotides synthesized within a library using the systems andmethods provided are less than 1/1000.

In some instances, an error correction enzyme may be used forpolynucleotides synthesized within a library using the systems andmethods provided can use. In some instances, aggregate error rates forpolynucleotides with error correction can be less than 1/500, 1/600,1/700, 1/800, 1/900, 1/1000, 1/1100, 1/1200, 1/1300, 1/1400, 1/1500,1/1600, 1/1700, 1/1800, 1/1900, 1/2000, 1/3000, or less compared to thepredetermined sequences. In some instances, aggregate error rates witherror correction for polynucleotides synthesized within a library usingthe systems and methods provided can be less than 1/500, 1/600, 1/700,1/800, 1/900, or 1/1000. In some instances, aggregate error rates witherror correction for polynucleotides synthesized within a library usingthe systems and methods provided can be less than 1/1000.

Error rate may limit the value of gene synthesis for the production oflibraries of gene variants. With an error rate of 1/300, about 0.7% ofthe clones in a 1500 base pair gene will be correct. As most of theerrors from polynucleotide synthesis result in frame-shift mutations,over 99% of the clones in such a library will not produce a full-lengthprotein. Reducing the error rate by 75% would increase the fraction ofclones that are correct by a factor of 40. The methods and compositionsof the disclosure allow for fast de novo synthesis of largepolynucleotide and gene libraries with error rates that are lower thancommonly observed gene synthesis methods both due to the improvedquality of synthesis and the applicability of error correction methodsthat are enabled in a massively parallel and time-efficient manner.Accordingly, libraries may be synthesized with base insertion, deletion,substitution, or total error rates that are under 1/300, 1/400, 1/500,1/600, 1/700, 1/800, 1/900, 1/1000, 1/1250, 1/1500, 1/2000, 1/2500,1/3000, 1/4000, 1/5000, 1/6000, 1/7000, 1/8000, 1/9000, 1/10000,1/12000, 1/15000, 1/20000, 1/25000, 1/30000, 1/40000, 1/50000, 1/60000,1/70000, 1/80000, 1/90000, 1/100000, 1/125000, 1/150000, 1/200000,1/300000, 1/400000, 1/500000, 1/600000, 1/700000, 1/800000, 1/900000,1/1000000, or less, across the library, or across more than 80%, 85%,90%, 93%, 95%, 96%, 97%, 98%, 99%, 99.5%, 99.8%, 99.9%, 99.95%, 99.98%,99.99%, or more of the library. The methods and compositions of thedisclosure further relate to large synthetic polynucleotide and genelibraries with low error rates associated with at least 30%, 40%, 50%,60%, 70%, 75%, 80%, 85%, 90%, 93%, 95%, 96%, 97%, 98%, 99%, 99.5%,99.8%, 99.9%, 99.95%, 99.98%, 99.99%, or more of the polynucleotides orgenes in at least a subset of the library to relate to error freesequences in comparison to a predetermined/preselected sequence. In someinstances, at least 30%, 40%, 50%, 60%, 70%, 75%, 80%, 85%, 90%, 93%,95%, 96%, 97%, 98%, 99%, 99.5%, 99.8%, 99.9%, 99.95%, 99.98%, 99.99%, ormore of the polynucleotides or genes in an isolated volume within thelibrary have the same sequence. In some instances, at least 30%, 40%,50%, 60%, 70%, 75%, 80%, 85%, 90%, 93%, 95%, 96%, 97%, 98%, 99%, 99.5%,99.8%, 99.9%, 99.95%, 99.98%, 99.99%, or more of any polynucleotides orgenes related with more than 95%, 96%, 97%, 98%, 99%, 99.5%, 99.6%,99.7%, 99.8%, 99.9% or more similarity or identity have the samesequence. In some instances, the error rate related to a specified locuson a polynucleotide or gene is optimized. Thus, a given locus or aplurality of selected loci of one or more polynucleotides or genes aspart of a large library may each have an error rate that is less than1/300, 1/400, 1/500, 1/600, 1/700, 1/800, 1/900, 1/1000, 1/1250, 1/1500,1/2000, 1/2500, 1/3000, 1/4000, 1/5000, 1/6000, 1/7000, 1/8000, 1/9000,1/10000, 1/12000, 1/15000, 1/20000, 1/25000, 1/30000, 1/40000, 1/50000,1/60000, 1/70000, 1/80000, 1/90000, 1/100000, 1/125000, 1/150000,1/200000, 1/300000, 1/400000, 1/500000, 1/600000, 1/700000, 1/800000,1/900000, 1/1000000, or less. In various instances, such error optimizedloci may comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100,200, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000,4000, 5000, 6000, 7000, 8000, 9000, 10000, 30000, 50000, 75000, 100000,500000, 1000000, 2000000, 3000000 or more loci. The error optimized locimay be distributed to at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90,100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500,3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 30000, 75000, 100000,500000, 1000000, 2000000, 3000000 or more polynucleotides or genes.

The error rates can be achieved with or without error correction. Theerror rates can be achieved across the library, or across more than 80%,85%, 90%, 93%, 95%, 96%, 97%, 98%, 99%, 99.5%, 99.8%, 99.9%, 99.95%,99.98%, 99.99%, or more of the library.

Computer Systems

Any of the systems described herein, may be operably linked to acomputer and may be automated through a computer either locally orremotely. In various instances, the methods and systems of thedisclosure may further comprise software programs on computer systemsand use thereof. Accordingly, computerized control for thesynchronization of the dispense/vacuum/refill functions such asorchestrating and synchronizing the material deposition device movement,dispense action and vacuum actuation are within the bounds of thedisclosure. The computer systems may be programmed to interface betweenthe user specified base sequence and the position of a materialdeposition device to deliver the correct reagents to specified regionsof the substrate.

The computer system 1200 illustrated in FIG. 12 may be understood as alogical apparatus that can read instructions from media 1211 and/or anetwork port 1205, which can optionally be connected to server 1209having fixed media 1212. The system, such as shown in FIG. 12 caninclude a CPU 1201, disk drives 1203, optional input devices such askeyboard 1215 and/or mouse 1216 and optional monitor 1207. Datacommunication can be achieved through the indicated communication mediumto a server at a local or a remote location. The communication mediumcan include any means of transmitting and/or receiving data. Forexample, the communication medium can be a network connection, awireless connection or an internet connection. Such a connection canprovide for communication over the World Wide Web. It is envisioned thatdata relating to the present disclosure can be transmitted over suchnetworks or connections for reception and/or review by a party 1222 asillustrated in FIG. 12.

FIG. 13 is a block diagram illustrating a first example architecture ofa computer system 1300 that can be used in connection with exampleinstances of the present disclosure. As depicted in FIG. 13, the examplecomputer system can include a processor 1302 for processinginstructions. Non-limiting examples of processors include: Intel Xeon™processor, AMD Opteron™ processor, Samsung 32-bit RISC ARM 1176JZ(F)-Sv1.0™ processor, ARM Cortex-A8 Samsung S5PC100™ processor, ARM Cortex-A8Apple A4™ processor, Marvell PXA 930™ processor, or afunctionally-equivalent processor. Multiple threads of execution can beused for parallel processing. In some instances, multiple processors orprocessors with multiple cores can also be used, whether in a singlecomputer system, in a cluster, or distributed across systems over anetwork comprising a plurality of computers, cell phones, and/orpersonal data assistant devices.

As illustrated in FIG. 13, a high speed cache 1304 can be connected to,or incorporated in, the processor 1302 to provide a high speed memoryfor instructions or data that have been recently, or are frequently,used by processor 1302. The processor 1302 is connected to a northbridge 1306 by a processor bus 1308. The north bridge 1306 is connectedto random access memory (RAM) 1310 by a memory bus 1312 and managesaccess to the RAM 1310 by the processor 1302. The north bridge 1306 isalso connected to a south bridge 1314 by a chipset bus 1316. The southbridge 1314 is, in turn, connected to a peripheral bus 1318. Theperipheral bus can be, for example, PCI, PCI-X, PCI Express, or otherperipheral bus. The north bridge and south bridge are often referred toas a processor chipset and manage data transfer between the processor,RAM, and peripheral components on the peripheral bus 1318. In somealternative architectures, the functionality of the north bridge can beincorporated into the processor instead of using a separate north bridgechip. In some instances, system 1300 can include an accelerator card1322 attached to the peripheral bus 1318. The accelerator can includefield programmable gate arrays (FPGAs) or other hardware foraccelerating certain processing. For example, an accelerator can be usedfor adaptive data restructuring or to evaluate algebraic expressionsused in extended set processing.

Software and data are stored in external storage 1324 and can be loadedinto RAM 1310 and/or cache 1304 for use by the processor. The system1300 includes an operating system for managing system resources;non-limiting examples of operating systems include: Linux, Windows™,MACOS™, BlackBerry OS™, iOS™, and other functionally-equivalentoperating systems, as well as application software running on top of theoperating system for managing data storage and optimization inaccordance with example instances of the present disclosure. In thisexample, system 1300 also includes network interface cards (NICs) 1320and 1321 connected to the peripheral bus for providing networkinterfaces to external storage, such as Network Attached Storage (NAS)and other computer systems that can be used for distributed parallelprocessing.

FIG. 14 is a diagram showing a network 1400 with a plurality of computersystems 1402 a, and 1402 b, a plurality of cell phones and personal dataassistants 1402 c, and Network Attached Storage (NAS) 1404 a, and 1404b. In example instances, systems 1402 a, 1402 b, and 1402 c can managedata storage and optimize data access for data stored in NetworkAttached Storage (NAS) 1404 a and 1404 b. A mathematical model can beused for the data and be evaluated using distributed parallel processingacross computer systems 1402 a, and 1402 b, and cell phone and personaldata assistant systems 1402 c. Computer systems 1402 a, and 1402 b, andcell phone and personal data assistant systems 1402 c can also provideparallel processing for adaptive data restructuring of the data storedin Network Attached Storage (NAS) 1404 a and 1404 b. FIG. 14 illustratesan example only, and a wide variety of other computer architectures andsystems can be used in conjunction with the various instances of thepresent disclosure. For example, a blade server can be used to provideparallel processing. Processor blades can be connected through a backplane to provide parallel processing. Storage can also be connected tothe back plane or as Network Attached Storage (NAS) through a separatenetwork interface. In some example instances, processors can maintainseparate memory spaces and transmit data through network interfaces,back plane or other connectors for parallel processing by otherprocessors. In other instances, some or all of the processors can use ashared virtual address memory space.

FIG. 15 is a block diagram of a multiprocessor computer system 1500using a shared virtual address memory space in accordance with anexample instance. The system includes a plurality of processors 1502 a-fthat can access a shared memory subsystem 1504. The system incorporatesa plurality of programmable hardware memory algorithm processors (MAPs)1506 a-f in the memory subsystem 1504. Each MAP 1506 a-f can comprise amemory 1508 a-f and one or more field programmable gate arrays (FPGAs)1510 a-f. The MAP provides a configurable functional unit and particularalgorithms or portions of algorithms can be provided to the FPGAs 1510a-f for processing in close coordination with a respective processor.For example, the MAPs can be used to evaluate algebraic expressionsregarding the data model and to perform adaptive data restructuring inexample instances. In this example, each MAP is globally accessible byall of the processors for these purposes. In one configuration, each MAPcan use Direct Memory Access (DMA) to access an associated memory 1508a-f, allowing it to execute tasks independently of, and asynchronouslyfrom the respective microprocessor 1502 a-f In this configuration, a MAPcan feed results directly to another MAP for pipelining and parallelexecution of algorithms.

The above computer architectures and systems are examples only, and awide variety of other computer, cell phone, and personal data assistantarchitectures and systems can be used in connection with exampleinstances, including systems using any combination of generalprocessors, co-processors, FPGAs and other programmable logic devices,system on chips (SOCs), application specific integrated circuits(ASICs), and other processing and logic elements. In some instances, allor part of the computer system can be implemented in software orhardware. Any variety of data storage media can be used in connectionwith example instances, including random access memory, hard drives,flash memory, tape drives, disk arrays, Network Attached Storage (NAS)and other local or distributed data storage devices and systems.

In example instances, the computer system can be implemented usingsoftware modules executing on any of the above or other computerarchitectures and systems. In other instances, the functions of thesystem can be implemented partially or completely in firmware,programmable logic devices such as field programmable gate arrays(FPGAs) as referenced in FIG. 15, system on chips (SOCs), applicationspecific integrated circuits (ASICs), or other processing and logicelements. For example, the Set Processor and Optimizer can beimplemented with hardware acceleration through the use of a hardwareaccelerator card, such as accelerator card 1322 illustrated in FIG. 13.

Numbered Embodiments

Provided herein are numbered embodiments 1-23. Embodiment 1. Apolynucleotide, wherein the polynucleotide comprises: a first strand,wherein the first strand comprises a first terminal adapter region, afirst non-complementary region, a first yoke region, and a first uniquemolecular identifier; and a second strand, wherein the second strandcomprises a second terminal adapter region, a second non-complementaryregion, a second yoke region, and a second unique molecular identifier;wherein the first yoke region and the second yoke region arecomplementary, wherein the first non-complementary region and the secondnon-complementary region are not complementary, and wherein the firstyoke region or the second yoke region comprise at least one nucleobaseanalogue. Embodiment 2. The polynucleotide of embodiment 1, wherein thenucleobase analogue increases the Tm of binding the first yoke region tothe second yoke region. Embodiment 3. The polynucleotide of embodiment 1or 2, wherein the nucleobase analogue is a locked nucleic acid (LNA) ora bridged nucleic acid (BNA). Embodiment 4. The polynucleotide of anyone of embodiments 1-3, wherein the complementary first yoke region andsecond yoke region are each less than 15 bases in length. Embodiment 5.The polynucleotide of any one of embodiments 1-3, wherein thecomplementary first yoke region and second yoke region are each than 10bases in length. Embodiment 6. The polynucleotide of any one ofembodiments 1-3, wherein the complementary first yoke region and secondyoke region are each less than 6 bases in length. Embodiment 7. Thepolynucleotide of any one of embodiments 1-6, wherein the first uniquemolecular identifier is 4-8 bases in length. Embodiment 8. Thepolynucleotide of any one of embodiments 1-6, wherein the second uniquemolecular identifier is 4-8 bases in length. Embodiment 9. Thepolynucleotide of any one of embodiments 1-8, wherein the first uniquemolecular identifier and the second unique molecular identifier arecomplementary. Embodiment 10. The polynucleotide of any one ofembodiments 1-8, wherein the first unique molecular identifier and thesecond unique molecular identifier are not complementary. Embodiment 11.A polynucleotide of any one of embodiments 1-10, further comprising asample nucleic acid. Embodiment 12. The polynucleotide of embodiment 11,wherein the sample nucleic acid is DNA. Embodiment 13. Thepolynucleotide of embodiment 11, wherein the sample nucleic acid isgenomic DNA. Embodiment 14. The polynucleotide of any one of embodiments11-13, wherein the genomic DNA is of human origin. Embodiment 15. Thepolynucleotide of any one of embodiments 1-14, wherein the first strandor the second strand further comprises at least one barcode. Embodiment16. The polynucleotide of embodiment 15, wherein the at least onebarcode is at least 8 bases in length. Embodiment 17. The polynucleotideof embodiment 15, wherein the at least one barcode is 8-12 bases inlength. Embodiment 18. The polynucleotide of any one of embodiments15-17, wherein the at least one barcode identifies the origin of thesample nucleic acid. Embodiment 19. A method of sequencing comprising:ligating one or more polynucleotides of any one of embodiments 1-11 to aplurality of sample nucleic acids to generate a library ofadapter-ligated sample polynucleotides, wherein at least some of theadapter-ligated sample polynucleotides are uniquely identifiable;amplifying the library; enriching the library for one or moreadapter-ligated sample polynucleotides in the library; sequencing theenriched library to generate a plurality of reads; organizing the readsbased on the first unique molecular identifier and the second uniquemolecular identifier to distinguish between amplification errors andsingle nucleotide polymorphisms present in the sample nucleic acids.Embodiment 20. The method of embodiment 19, wherein the first uniquemolecular identifier and the second unique molecular identifier areselected from a set of no more than 64 sequences. Embodiment 21. Themethod of embodiment 19, wherein the first unique molecular identifierand the second unique molecular identifier are selected from a set of nomore than 48 sequences. Embodiment 22. The method of embodiment 19,wherein the first unique molecular identifier and the second uniquemolecular identifier are selected from a set of sequences having aHamming distance of at least 2. Embodiment 23. The method of embodiment19, wherein the single nucleotide polymorphisms are present at less than1% abundance in the sample nucleic acids.

EXAMPLES

The following examples are given for the purpose of illustrating variousembodiments of the invention and are not meant to limit the presentinvention in any fashion. The present examples, along with the methodsdescribed herein are presently representative of preferred embodiments,are exemplary, and are not intended as limitations on the scope of theinvention. Changes therein and other uses which are encompassed withinthe spirit of the invention as defined by the scope of the claims willoccur to those skilled in the art.

Example 1: Functionalization of a Substrate Surface

A substrate was functionalized to support the attachment and synthesisof a library of polynucleotides. The substrate surface was first wetcleaned using a piranha solution comprising 90% H₂SO₄ and 10% H₂O₂ for20 minutes. The substrate was rinsed in several beakers with DI water,held under a DI water gooseneck faucet for 5 minutes, and dried with N₂.The substrate was subsequently soaked in NH₄OH (1:100; 3 mL:300 mL) for5 minutes, rinsed with DI water using a handgun, soaked in threesuccessive beakers with DI water for 1 minute each, and then rinsedagain with DI water using the handgun. The substrate was then plasmacleaned by exposing the substrate surface to O₂. A SAMCO PC-300instrument was used to plasma etch O₂ at 250 watts for 1 minute indownstream mode.

The cleaned substrate surface was actively functionalized with asolution comprising N-(3-triethoxysilylpropyl)-4-hydroxybutyramide usinga YES-1224P vapor deposition oven system with the following parameters:0.5 to 1 torr, 60 minutes, 70° C., 135° C. vaporizer. The substratesurface was resist coated using a Brewer Science 200× spin coater. SPR™3612 photoresist was spin coated on the substrate at 2500 rpm for 40seconds. The substrate was pre-baked for 30 minutes at 90° C. on aBrewer hot plate. The substrate was subjected to photolithography usinga Karl Suss MA6 mask aligner instrument. The substrate was exposed for2.2 seconds and developed for 1 minute in MSF 26A. Remaining developerwas rinsed with the handgun and the substrate soaked in water for 5minutes. The substrate was baked for 30 minutes at 100° C. in the oven,followed by visual inspection for lithography defects using a NikonL200. A descum process was used to remove residual resist using theSAMCO PC-300 instrument to O₂ plasma etch at 250 watts for 1 minute.

The substrate surface was passively functionalized with a 100 μLsolution of perfluorooctyltrichlorosilane mixed with 10 μL light mineraloil. The substrate was placed in a chamber, pumped for 10 minutes, andthen the valve was closed to the pump and left to stand for 10 minutes.The chamber was vented to air. The substrate was resist stripped byperforming two soaks for 5 minutes in 500 mL NMP at 70° C. withultrasonication at maximum power (9 on Crest system). The substrate wasthen soaked for 5 minutes in 500 mL isopropanol at room temperature withultrasonication at maximum power. The substrate was dipped in 300 mL of200 proof ethanol and blown dry with N₂. The functionalized surface wasactivated to serve as a support for polynucleotide synthesis.

Example 2: Synthesis of a 50-Mer Sequence on a Polynucleotide SynthesisDevice

A two dimensional polynucleotide synthesis device was assembled into aflowcell, which was connected to a flowcell (Applied Biosystems (ABI394DNA Synthesizer”). The polynucleotide synthesis device was uniformlyfunctionalized with N-(3-TRIETHOXYSILYLPROPYL)-4-HYDROXYBUTYRAMIDE(Gelest) was used to synthesize an exemplary polynucleotide of 50 bp(“50-mer polynucleotide”) using polynucleotide synthesis methodsdescribed herein.

The sequence of the 50-mer was as described in SEQ ID NO.: 1.5′AGACAATCAACCATTTGGGGTGGACAGCCTTGACCTCTAGACTTCGGCAT##TTTTTT TTTT3′ (SEQID NO.: 1), where # denotes Thymidine-succinyl hexamide CEDphosphoramidite (CLP-2244 from ChemGenes), which is a cleavable linkerenabling the release of polynucleotides from the surface duringdeprotection.

The synthesis was done using standard DNA synthesis chemistry (coupling,capping, oxidation, and deblocking) according to the protocol in Table 2and an ABI synthesizer.

TABLE 2 TABLE 2 General DNA Synthesis Time Process Name Process Step(seconds) WASH (Acetonitrile Wash Acetonitrile System Flush 4   Flow)Acetonitrile to Flowcell 23    N2 System Flush 4   Acetonitrile SystemFlush 4   DNA BASE ADDITION Activator Manifold Flush 2  (Phosphoramidite + Activator to Flowcell 6   Activator Flow) Activator +6   Phosphoramidite to Flowcell Activator to Flowcell 0.5 Activator +5   Phosphoramidite to Flowcell Activator to Flowcell 0.5 Activator +5   Phosphoramidite to Flowcell Activator to Flowcell 0.5 Activator +5   Phosphoramidite to Flowcell Incubate for 25 sec 25    WASH(Acetonitrile Wash Acetonitrile System Flush 4   Flow) Acetonitrile toFlowcell 15    N2 System Flush 4   Acetonitrile System Flush 4   DNABASE ADDITION Activator Manifold Flush 2   (Phosphoramidite + Activatorto Flowcell 5   Activator Flow) Activator + 18    Phosphoramidite toFlowcell Incubate for 25 sec 25    WASH (Acetonitrile Wash AcetonitrileSystem Flush 4   Flow) Acetonitrile to Flowcell 15    N2 System Flush4   Acetonitrile System Flush 4   CAPPING (CapA + B, 1:1, CapA + B toFlowcell 15    Flow) WASH (Acetonitrile Wash Acetonitrile System Flush4   Flow) Acetonitrile to Flowcell 15    Acetonitrile System Flush 4  OXIDATION (Oxidizer Oxidizer to Flowcell 18    Flow) WASH (AcetonitrileWash Acetonitrile System Flush 4   Flow) N2 System Flush 4  Acetonitrile System Flush 4   Acetonitrile to Flowcell 15   Acetonitrile System Flush 4   Acetonitrile to Flowcell 15    N2 SystemFlush 4   Acetonitrile System Flush 4   Acetonitrile to Flowcell 23   N2 System Flush 4   Acetonitrile System Flush 4   DEBLOCKING (DeblockDeblock to Flowcell 36    Flow) WASH (Acetonitrile Wash AcetonitrileSystem Flush 4   Flow) N2 System Flush 4   Acetonitrile System Flush 4  Acetonitrile to Flowcell 18    N2 System Flush  4.13 Acetonitrile SystemFlush  4.13 Acetonitrile to Flowcell 15   

The phosphoramidite/activator combination was delivered similar to thedelivery of bulk reagents through the flowcell. No drying steps wereperformed as the environment stays “wet” with reagent the entire time.

The flow restrictor was removed from the ABI 394 synthesizer to enablefaster flow. Without flow restrictor, flow rates for amidites (0.1M inACN), Activator, (0.25M Benzoylthiotetrazole (“BTT”; 30-3070-xx fromGlenResearch) in ACN), and Ox (0.02M I₂ in 20% pyridine, 10% water, and70% THF) were roughly ˜100 uL/second, for acetonitrile (“ACN”) andcapping reagents (1:1 mix of CapA and CapB, wherein CapA is aceticanhydride in THF/Pyridine and CapB is 16% 1-methylimidizole in THF),roughly ˜200 uL/second, and for Deblock (3% dichloroacetic acid intoluene), roughly ˜300 uL/second (compared to ˜50 uL/second for allreagents with flow restrictor). The time to completely push out Oxidizerwas observed, the timing for chemical flow times was adjustedaccordingly and an extra ACN wash was introduced between differentchemicals. After polynucleotide synthesis, the chip was deprotected ingaseous ammonia overnight at 75 psi. Five drops of water were applied tothe surface to recover polynucleotides. The recovered polynucleotideswere then analyzed on a BioAnalyzer small RNA chip (data not shown).

Example 3: Synthesis of a 100-Mer Sequence on a Polynucleotide SynthesisDevice

The same process as described in Example 2 for the synthesis of the50-mer sequence was used for the synthesis of a 100-mer polynucleotide(“100-mer polynucleotide”; 5′CGGGATCCTTATCGTCATCGTCGTACAGATCCCGACCCATTTGCTGTCCACCAGTCATGCTAGCCATACCATGATGATGATGATGATGAGAACCCCGCAT##TTTTTTTTTT3′, where # denotesThymidine-succinyl hexamide CED phosphoramidite (CLP-2244 fromChemGenes); SEQ ID NO.: 2) on two different silicon chips, the first oneuniformly functionalized withN-(3-TRIETHOXYSILYLPROPYL)-4-HYDROXYBUTYRAMIDE and the second onefunctionalized with 5/95 mix of 11-acetoxyundecyltriethoxysilane andn-decyltriethoxysilane, and the polynucleotides extracted from thesurface were analyzed on a BioAnalyzer instrument (data not shown).

All ten samples from the two chips were further PCR amplified using aforward (5′ATGCGGGGTTCTCATCATC3; SEQ ID NO.: 3) and a reverse(5′CGGGATCCTTATCGTCATCG3; SEQ ID NO.: 4) primer in a 50 uL PCR mix (25uL NEB Q5 master mix, 2.5 uL 10 uM Forward primer, 2.5 uL 10 uM Reverseprimer, 1 uL polynucleotide extracted from the surface, and water up to50 uL) using the following thermal cycling program:

98 C, 30 seconds

98 C, 10 seconds; 63 C, 10 seconds; 72 C, 10 seconds; repeat 12 cycles

72 C, 2 minutes

The PCR products were also run on a BioAnalyzer (data not shown),demonstrating sharp peaks at the 100-mer position. Next, the PCRamplified samples were cloned, and Sanger sequenced. Table 3 summarizesthe results from the Sanger sequencing for samples taken from spots 1-5from chip 1 and for samples taken from spots 6-10 from chip 2.

TABLE 3 Spot Error rate Cycle efficiency  1 1/763 bp 99.87%  2 1/824 bp99.88%  3 1/780 bp 99.87%  4 1/429 bp 99.77%  5  1/1525 bp 99.93%  6 1/1615 bp 99.94%  7 1/531 bp 99.81%  8  1/1769 bp 99.94%  9 1/854 bp99.88% 10  1/1451 bp 99.93%

Thus, the high quality and uniformity of the synthesized polynucleotideswere repeated on two chips with different surface chemistries. Overall,89%, corresponding to 233 out of 262 of the 100-mers that were sequencedwere perfect sequences with no errors.

Finally, Table 4 summarizes error characteristics for the sequencesobtained from the polynucleotides samples from spots 1-10.

TABLE 4 Sample ID/Spot OSA_ OSA_ no. 0046/1 0047/2 OSA_0048/3 OSA_0049/4OSA_0050/5 OSA_0051/6 OSA_0052/7 OSA_0053/8 OSA_0054/9 OSA_055/10 Total 32  32 32  32  32  32  32  32  32  32  Sequences Sequencing 25 of 27 of26 of 21 of 25 of 29 of 27 of 29 of 28 of 25 of 28 Quality 28 27 30 2326 30 31 31 29 Oligo 23 of 25 of 22 of 18 of 24 of 25 of 22 of 28 of 26of 20 of 25 Quality 25 27 26 21 25 29 27 29 28 ROI 2500 2698 2561  2122   2499   2666   2625   2899   2798   2348   Match Count ROI   2   21 3 1 0 2 1 2 1 Mutation ROI Multi   0   0 0 0 0 0 0 0 0 0 Base DeletionROI Small   1   0 0 0 0 0 0 0 0 0 Insertion ROI   0   0 0 0 0 0 0 0 0 0Single Base Deletion Large   0   0 1 0 0 1 1 0 0 0 Deletion CountMutation:   2   2 1 2 1 0 2 1 2 1 G > A Mutation:   0   0 0 1 0 0 0 0 00 T > C ROI Error   3   2 2 3 1 1 3 1 2 1 Count ROI Error Err: ~1 Err:~1 Err: ~1 Err: ~1 Err: ~1 Err: ~1 Err: ~1 Err: ~1 Err: ~1 Err: ~1 Ratein 834 in 1350 in 1282 in 708 in 2500 in 2667 in 876 in 2900 in 1400 in2349 ROI MP Err: MP Err: MP Err: MP Err: MP Err: MP Err: MP Err: MP Err:MP Err: MP Err: Minus ~1 in ~1 in ~1 in ~1 in ~1 in ~1 in ~1 in ~1 in ~1in ~1 in Primer 763 824 780 429 1525 1615 531 1769 854 1451 Error Rate

Example 4: Parallel Assembly of 29,040 Unique Polynucleotides

A structure comprising 256 clusters each comprising 121 loci on a flatsilicon plate 1001 was manufactured as shown in FIG. 10. An expandedview of a cluster is shown in 1005 with 121 loci. Loci from 240 of the256 clusters provided an attachment and support for the synthesis ofpolynucleotides having distinct sequences. Polynucleotide synthesis wasperformed by phosphoramidite chemistry using general methods fromExample 3. Loci from 16 of the 256 clusters were control clusters. Theglobal distribution of the 29,040 unique polynucleotides synthesized(240×121) is shown in FIG. 11A. Polynucleotide libraries weresynthesized at high uniformity. 90% of sequences were present at signalswithin 4× of the mean, allowing for 100% representation. Distributionwas measured for each cluster, as shown in FIG. 11B. On a global level,all polynucleotides in the run were present and 99% of thepolynucleotides had abundance that was within 2× of the mean indicatingsynthesis uniformity. This same observation was consistent on aper-cluster level.

The error rate for each polynucleotide was determined using an IlluminaMiSeq gene sequencer. The error rate distribution for the 29,040 uniquepolynucleotides averages around 1 in 500 bases, with some error rates aslow as 1 in 800 bases. Distribution was measured for each cluster. Thelibrary of 29,040 unique polynucleotides was synthesized in less than 20hours. Analysis of GC percentage versus polynucleotide representationacross all of the 29,040 unique polynucleotides showed that synthesiswas uniform despite GC content.

Example 6. Library preparation with universal adapters

Nucleic acid samples (50 ug) were prepared comprising either dual-indexadapters or universal adapters. A ligation master mix is prepared from20 uL of ligation buffer 10 uL of ligation mix (containing ligase), and15 uL water. The nucleic acid sample was combined with the ligation mixand incubated at 20 deg C. at 15 minutes. The mixture was then combinedwith 80 uL of magnetic DNA purification beads, and vortexed, followed by5 minutes of incubation at room temperature. The mixture was then set ona magnetic plate for 1 min. The beads were then washed with 80% ethanol,incubated for 1 min, and the ethanol wash discarded. The wash wasrepeated once. Then, beads were air-dried for 5-10 minutes, removed fromthe magnetic plate, and treated with 17 uL of water, 10 mM Tris-HCl pH8, or buffer EB. The mixture was homogenized and incubated 2 min at roomtemperature. The mixture was then placed again on the magnetic plate andincubated 3 min at room temperature, followed by removal of thesupernatant containing the universal adapter-ligated genomic DNA. Theuniversal-ligated genomic DNA is combined with 10 uL of barcoded primersand 25 uL of KAPA HiFi HotStart ReadyMix to attach barcodes to theuniversal primers. The following PCR conditions were used: 1)initialization at 98 deg C. for 45 seconds, 2) a second step comprising:a) denaturation at 98 deg C. for 15 sec, b) annealing at 60 deg C. for30 sec, and c) extension at 72 deg C. for 30 sec; wherein second step isrepeated for 6-8 cycles, 3) final extension at 72 deg C. for 1 minute,and 4) final hold at 4 deg C. Products were purified by DNA beads in asimilar manner as previously described. The amplified barcoded librarywas analyzed on a Qubit dsDNA broad range quantification assayinstrument. This library was then sequenced directly. Use of universaladapters resulted in increased library nucleic acid concentration afteramplification relative to standard dual-index Y-adapters. The protocolutilizing universal adapters also led to higher total yields afteramplification and lower adapter dimer formation. Additionally, a libraryprepared with universal adapters provided for lower AT dropouts comparedto standard dual-index Y-adapters, and resulted in uniformrepresentation of all index sequences. Similarly, universal adapterscomprising 10 bp dual indices were utilized (8 PCR cycles, N=12). Forcomparison, standard full-length Y adapters were also tested for thesame genomic DNA sample (10 PCR cycles, N=12).

Example 7. Library Preparation with Universal Adapters and Enrichment

A nucleic acid sample was prepared using the general methods of Example6, with modification: dual-index adapters were replaced with universaladapters. After ligation of universal adapters, amplification of theadapter-ligated sample nucleic acid library was conducted with abarcoded primer library, to generate a barcoded adapter-ligated samplenucleic acid library. This library was then subjected to analogousenrichment, purification, and sequencing steps. Use of universaladapters resulted in comparable or better sequencing outcomes.

Example 8. Library Preparation with Unique Molecular Identifiers

A nucleic acid sample was prepared using the general methods of Example6, with modification: dual-index adapters comprised a 4-base length UMI,selected from a set of 48 unique sequences having at least a Hammingdistance of 2. Adapters were added using the general procedure of FIGS.1A-1B. Three different amounts of sample genomic DNA were evaluated. Amass titration plot for 5 ng, 10 ng, and 50 ng of input genomic DNAagainst mean target coverage is shown in FIG. 2B. The number of familieswas examined for different mass inputs, FIGS. 3A-3C, and the mean targetcoverage after deduplication increased with increased mass input. Thenumber of families was also examined as a function of different numbersof UMI pairs FIGS. 4A-4B, 5A-5B. Variant calling accuracy was alsoexamined using the TruQ NGS DNA Reference Standard (Table 5). For 10%TruQ vs 1% TruQ, mass input and depth of sequence impacted the accuracyof variant calling (FIGS. 6A-6B).

TABLE 5 Expected Chromosome Allelic 10% 1% Number Gene VariantFrequency, % dilution dilution 7p12 EGFR G719S 16.70% 1.67% 0.167% 7q34BRAF V600E  8.00%  0.8%  0.08% 12p12.1 KRAS G13D 25.00%  2.5%  0.25%3q26. PIK3CA H1047R 30.00%    3%   0.3%

For comparison, variant calling using all reads (no UMIs ordeduplication) is shown in FIG. 7A (20,000× read depth), and FIG. 7B(40,000× read depth). Variant calls using duplex consensus reads areshown in FIG. 8A (20,000× read depth), and FIG. 8B (40,000× read depth).

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. Numerousvariations, changes, and substitutions will now occur to those skilledin the art without departing from the invention. It should be understoodthat various alternatives to the embodiments of the invention describedherein may be employed in practicing the invention. It is intended thatthe following claims define the scope of the invention and that methodsand structures within the scope of these claims and their equivalents becovered thereby.

Example 9. Unique Molecular Identifier Design

A UMI set was designed with the sequences of Table 6.

TABLE 6 Reverse  UMI UMI Complement of Length Index SequenceUMI Sequence (bp) 1 AAGGA TCCTT 5 2 ACAAC GTTGT 5 3 ATACG CGTAT 5 4CACTG CAGTG 5 5 CATGA TCATG 5 6 CGATA TATCG 5 7 CGTGT ACACG 5 8 GCCATATGGC 5 9 GCTGT ACAGC 5 10 GTCAC GTGAC 5 11 GTCGT ACGAC 5 12 TACGA TCGTA5 13 TCCTA TAGGA 5 14 TCGTG CACGA 5 15 TGTCG CGACA 5 16 TTGGC GCCAA 5 17AACACA TGTGTT 6 18 AATGCC GGCATT 6 19 ACTAGG CCTAGT 6 20 AGCATC GATGCT 621 AGTACA TGTACT 6 22 ATCTCC GGAGAT 6 23 CAGACG CGTCTG 6 24 CAGTACGTACTG 6 25 CGAATC GATTCG 6 26 CGGTTG CAACCG 6 27 CTTGGA TCCAAG 6 28GCATAG CTATGC 6 29 GCTAAC GTTAGC 6 30 GTGAGA TCTCAC 6 31 GTGTCA TGACAC 632 TGTGCC GGCACA 6

UMIs were included into adapters using the general design shown in FIG.16. After the adapters were ligated to a genomic sample, the abundanceof each UMI was measured using sequencing. (FIG. 17). Next, UMIperformance was evaluated by measuring single base substitution recallof SNVs in (LB) and (EM) samples (FIG. 19A). Performance was alsomeasured using a pan-cancer control sample (FIG. 19B). Exemplaryadapter-ligated gDNA library fragments are shown in FIGS. 20A-20B.

1. A method of sequencing comprising: ligating one or morepolynucleotide adapters to a plurality of sample nucleic acids togenerate a library of adapter-ligated sample polynucleotides, wherein atleast some of the polynucleotide adapters comprise a first uniquemolecular identifier and a second unique molecular identifier;amplifying the library; sequencing the enriched library to generate aplurality of reads; organizing the reads based on the first uniquemolecular identifier and the second unique molecular identifier todistinguish between amplification errors and single nucleotidepolymorphisms present in the sample nucleic acids, and wherein at least80% of SNV variants are called at a level of at least 1%.
 2. The methodof claim 1, wherein the SNV variants are called with a minimumsequencing depth of 10,000×.
 3. The method of claim 1, wherein at least90% of SNV variants are called at a level of at least 1%. 4-5.(canceled)
 6. The method of claim 1, wherein at least 95% of SNVvariants are called at a level of at least 1% with a minimum sequencingdepth of 10,000×.
 7. (canceled)
 8. The method of claim 1, wherein thefirst unique molecular identifier and the second unique molecularidentifier are selected from a set of no more than 64 sequences. 9-10.(canceled)
 11. The method of claim 1, wherein the duplex efficiencycomprises the number of duplex reads after sequencing after the duplexcollapses divided by the total number of input reads.
 12. The method ofclaim 11, wherein the polynucleotide adapter comprises a duplexefficiency of at least 4%.
 13. (canceled)
 14. The method of claim 11,wherein the method has a recall of at least 20% for single nucleotidepolymorphisms present at least at 0.2% abundance in the sample nucleicacids.
 15. (canceled)
 16. A library of polynucleotide adapterscomprising: at least two polynucleotide adapters, each comprising: afirst strand, wherein the first strand comprises a first terminaladapter region, a first non-complementary region, a first yoke region,and a first unique molecular identifier; and a second strand, whereinthe second strand comprises a second terminal adapter region, a secondnon-complementary region, a second yoke region, and a second uniquemolecular identifier; wherein the first yoke region and the second yokeregion are complementary, wherein the first non-complementary region andthe second non-complementary region are not complementary, and whereinthe fraction of each of the polynucleotides in the library is 1-5%. 17.The library of claim 16, wherein the fraction of each of thepolynucleotides in the library is 1.5-4.5%.
 18. The library of claim 16,wherein the library comprises at least 8 different unique molecularidentifiers. 19-25. (canceled)
 26. The library of claim 16, wherein thefirst unique molecular identifier and the second unique molecularidentifier are 5 or 6 bases in length.
 27. The library of claim 16,wherein the first unique molecular identifier and the second uniquemolecular identifier are complementary.
 28. (canceled)
 29. The libraryof claim 16, wherein the first unique molecular identifier or the secondunique molecular identifier comprise the sequences of one or more ofAAGGA, ACAAC, ATACG, CACTG, CATGA, CGATA, CGTGT, GCCAT, GCTGT, GTCAC,GTCGT, TACGA, TCCTA, TCGTG, TGTCG, TTGGC, AACACA, AATGCC, ACTAGG,AGCATC, AGTACA, ATCTCC, CAGACG, CAGTAC, CGAATC, CGGTTG, CTTGGA, GCATAG,GCTAAC, GTGAGA, GTGTCA, and TGTGCC.
 30. The library of claim 16, whereinthe first unique molecular identifier or the second unique molecularidentifier comprise the sequences of 10 or more of AAGGA, ACAAC, ATACG,CACTG, CATGA, CGATA, CGTGT, GCCAT, GCTGT, GTCAC, GTCGT, TACGA, TCCTA,TCGTG, TGTCG, TTGGC, AACACA, AATGCC, ACTAGG, AGCATC, AGTACA, ATCTCC,CAGACG, CAGTAC, CGAATC, CGGTTG, CTTGGA, GCATAG, GCTAAC, GTGAGA, GTGTCA,and TGTGCC.
 31. The library of claim 16, wherein the first uniquemolecular identifier or the second unique molecular identifier comprisethe sequences of one or more of AAGGA, ACAAC, ATACG, CACTG, CATGA,CGATA, CGTGT, GCCAT, GCTGT, GTCAC, GTCGT, TACGA, TCCTA, TCGTG, TGTCG,and TTGGC.
 32. The library of claim 16, wherein the first uniquemolecular identifier or the second unique molecular identifier comprisethe sequences of one or more of AACACA, AATGCC, ACTAGG, AGCATC, AGTACA,ATCTCC, CAGACG, CAGTAC, CGAATC, CGGTTG, CTTGGA, GCATAG, GCTAAC, GTGAGA,GTGTCA, and TGTGCC.
 33. A polynucleotide adapter of claim 16, furthercomprising a sample nucleic acid.
 34. (canceled)
 35. The polynucleotideadapter of claim 33, wherein the sample nucleic acid is genomic DNA. 36.(canceled)
 37. The polynucleotide adapter of claim 16, wherein the firststrand or the second strand further comprises at least one barcode.38-40. (canceled)